Web Scraping in R

Haley Jeppson, Sam Tyner

June 15, 2017

A web of data

  • In 2008, an estimated 154 million HTML tables (out of the 14.1 billion) contain ‘high quality relational data’!!!
  • Hard to quantify how much more exists outside of HTML Tables, but there is an estimate of at least 30 million lists with ‘high quality relational data’.
  • A growing number of websites/companies provide programmatic access to their data/services via web APIs (that data typically comes in XML/JSON format).

Before scraping, do some googling!

A web of messy data!

  • Recall the concept of tidy data.
  • Data is in a table where
    • 1 row == 1 observation
    • 1 column == 1 variable (observational attribute)
  • Parsing web data (HTML/XML/JSON) is easy (for computers)
  • Getting it in a tidy form is typically not easy.
  • Knowing a bit about modern tools & web technologies makes it much easier.

What is webscraping?

  • Extract data from websites
    • Tables
    • Links to other websites
    • Text

Why webscrape?

  • Because copy-paste is awful
  • Because it’s fast
  • Because you can automate it

Don’t abuse your power

  • Before scraping a website, please read the terms and conditions!!
  • It’s sometimes more efficient/appropriate to find the API.
  • If a website public offers an API, USE IT (instead of scraping)!!! (more on this later)

http://www.wired.com/2014/01/how-to-hack-okcupid

http://www.vox.com/2016/5/12/11666116/70000-okcupid-users-data-release

Webscraping with rvest:
Step-by-Step Start Guide

Step 1: Find a URL

What data do you want?

  • Information on Oscar-nominated winning film Moonlight

Find it on the web!

# character variable containing the url you want to scrape
myurl <- "http://www.imdb.com/title/tt4975722/"

Step 2: Read HTML into R

“Huh? What am I doing?” - some of you right now

  • HTML is HyperText Markup Language. All webpages are written with it.
  • Go to any website, right click, click “View Page Source” to see the HTML
library(tidyverse)
library(rvest)
myhtml <- read_html(myurl)
myhtml
## {xml_document}
## <html xmlns:og="http://ogp.me/ns#" xmlns:fb="http://www.facebook.com/2008/fbml">
## [1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset= ...
## [2] <body id="styleguide-v2" class="fixed">\n<script>\n    if (typeof ue ...

Step 3: Figure out
where your data is

Need to find your data within the myhtml object.

Tags to look for:

  • <p>: paragraphs
  • <h1>, <h2>, etc.: headers
  • <a>: links
  • <li>: item in a list
  • <table>: tables

Use Selector Gadget to find the exact location. (Demo)

For more on HTML, I recommend W3schools’ tutorial

  • You don’t need to be an expert in HTML to webscrape with rvest!

Step 4: Tell rvest where to find your data

Copy-paste from Selector Gadget or give HTML tags into html_nodes() to extract your data of interest

myhtml %>% html_nodes(".summary_text") %>% html_text()
## [1] "\n                    A chronicle of the childhood, adolescence and burgeoning adulthood of a young, African-American, gay man growing up in a rough neighborhood of Miami.\n            "
myhtml %>% html_nodes("table") %>% html_table(header = TRUE)
## [[1]]
##    Cast overview, first billed only: Cast overview, first billed only:
## 1                                 NA                    Mahershala Ali
## 2                                 NA                      Shariff Earp
## 3                                 NA                    Duan Sanderson
## 4                                 NA                   Alex R. Hibbert
## 5                                 NA                     Janelle Monáe
## 6                                 NA                     Naomie Harris
## 7                                 NA                       Jaden Piner
## 8                                 NA            Herman 'Caheei McGloun
## 9                                 NA                  Kamal Ani-Bellow
## 10                                NA                      Keomi Givens
## 11                                NA                   Eddie Blanchard
## 12                                NA                       Rudi Goblen
## 13                                NA                    Ashton Sanders
## 14                                NA                        Edson Jean
## 15                                NA                    Patrick Decile
##    Cast overview, first billed only:
## 1                                ...
## 2                                ...
## 3                                ...
## 4                                ...
## 5                                ...
## 6                                ...
## 7                                ...
## 8                                ...
## 9                                ...
## 10                               ...
## 11                               ...
## 12                               ...
## 13                               ...
## 14                               ...
## 15                               ...
##                        Cast overview, first billed only:
## 1                                                   Juan
## 2                                               Terrence
## 3            Azu \n  \n  \n  (as Duan 'Sandy' Sanderson)
## 4                   Little \n  \n  \n  (as Alex Hibbert)
## 5                                                 Teresa
## 6                                                  Paula
## 7                                            Kevin age 9
## 8  Longshoreman \n  \n  \n  (as Herman 'Caheej' McCloun)
## 9                                         Portable Boy 1
## 10                                        Portable Boy 2
## 11                                        Portable Boy 3
## 12                                                   Gee
## 13                                                Chiron
## 14                                            Mr. Pierce
## 15                                                Terrel
## 
## [[2]]
##                     Amazon Affiliates
## 1 Amazon VideoWatch Movies &TV Online
##                              Amazon Affiliates
## 1 Prime VideoUnlimited Streamingof Movies & TV
##                          Amazon Affiliates
## 1 Amazon GermanyBuy Movies onDVD & Blu-ray
##                        Amazon Affiliates
## 1 Amazon ItalyBuy Movies onDVD & Blu-ray
##                         Amazon Affiliates
## 1 Amazon FranceBuy Movies onDVD & Blu-ray
##                       Amazon Affiliates          Amazon Affiliates
## 1 Amazon IndiaBuy Movie andTV Show DVDs DPReviewDigitalPhotography
##            Amazon Affiliates
## 1 AudibleDownloadAudio Books

Step 5: Save & tidy data

library(stringr)
library(magrittr)
mydat <- myhtml %>% 
  html_nodes("table") %>%
  extract2(1) %>% 
  html_table(header = TRUE)
mydat <- mydat[,c(2,4)]
names(mydat) <- c("Actor", "Role")
mydat <- mydat %>% 
  mutate(Actor = Actor,
         Role = str_replace_all(Role, "\n  ", ""))
mydat
##                     Actor                                      Role
## 1          Mahershala Ali                                      Juan
## 2            Shariff Earp                                  Terrence
## 3          Duan Sanderson           Azu (as Duan 'Sandy' Sanderson)
## 4         Alex R. Hibbert                  Little (as Alex Hibbert)
## 5           Janelle Monáe                                    Teresa
## 6           Naomie Harris                                     Paula
## 7             Jaden Piner                               Kevin age 9
## 8  Herman 'Caheei McGloun Longshoreman (as Herman 'Caheej' McCloun)
## 9        Kamal Ani-Bellow                            Portable Boy 1
## 10           Keomi Givens                            Portable Boy 2
## 11        Eddie Blanchard                            Portable Boy 3
## 12            Rudi Goblen                                       Gee
## 13         Ashton Sanders                                    Chiron
## 14             Edson Jean                                Mr. Pierce
## 15         Patrick Decile                                    Terrel

Your Turn #1

Using rvest, scrape a table from Wikipedia. You can pick your own table or you can get one of the tables in the country GDP per capita example from earlier.

Your result should be a data frame with one observation per row and one variable per column.

Your Turn #1 Solution

library(rvest)
library(magrittr)
myurl <- "https://en.wikipedia.org/wiki/List_of_countries_by_GDP_(PPP)_per_capita"
myhtml <- read_html(myurl)
myhtml %>% 
 html_nodes("table") %>%
 extract2(2) %>%
 html_table(header = TRUE) %>% 
 mutate(`Int$` = parse_number(`Int$`)) %>% 
 head
##   Rank    Country   Int$
## 1    1      Qatar 127660
## 2    2 Luxembourg 104003
## 3    —      Macau  95151
## 4    3  Singapore  87855
## 5    4     Brunei  76884
## 6    5     Kuwait  71887

Deeper dive into rvest

Key Functions: html_nodes

  • html_nodes(x, "path") extracts all elements from the page x that have the tag / class / id path. (Use SelectorGadget to determine path.)
  • html_node() does the same thing but only returns the first matching element.
  • Can be chained
myhtml %>% 
  html_nodes("p") %>% # first get all the paragraphs 
  html_nodes("a") # then get all the links in those paragraphs
## {xml_nodeset (24)}
##  [1] <a href="/wiki/Purchasing_power_parity" title="Purchasing power par ...
##  [2] <a href="/wiki/Goods_and_services" title="Goods and services">goods ...
##  [3] <a href="/wiki/Gross_domestic_product" title="Gross domestic produc ...
##  [4] <a href="/wiki/Per_capita" title="Per capita">per capita</a>
##  [5] <a href="/wiki/International_Monetary_Fund" title="International Mo ...
##  [6] <a href="/wiki/World_Bank" title="World Bank">World Bank</a>
##  [7] <a href="/wiki/National_wealth" title="National wealth">national we ...
##  [8] <a href="/wiki/Savings" class="mw-redirect" title="Savings">savings ...
##  [9] <a href="/wiki/Cost_of_living" title="Cost of living">cost of livin ...
## [10] <a href="/wiki/List_of_countries_by_GDP_(nominal)_per_capita" title ...
## [11] <a href="https://en.wiktionary.org/wiki/generalized" class="extiw"  ...
## [12] <a href="/wiki/Living_standards" class="mw-redirect" title="Living  ...
## [13] <a href="/wiki/Inflation_rates" class="mw-redirect" title="Inflatio ...
## [14] <a href="/wiki/Exchange_rates" class="mw-redirect" title="Exchange  ...
## [15] <a href="#cite_note-2">[2]</a>
## [16] <a href="#cite_note-3">[3]</a>
## [17] <a href="/wiki/Personal_income" title="Personal income">personal in ...
## [18] <a href="/wiki/Gross_domestic_product#Standard_of_living_and_GDP:_W ...
## [19] <a href="/wiki/Geary%E2%80%93Khamis_dollar" title="Geary–Khamis dol ...
## [20] <a href="/wiki/Rounding" title="Rounding">rounded</a>
## ...

Key Functions: html_text

  • html_text(x) extracts all text from the nodeset x
  • Good for cleaning output
myhtml %>% 
  html_nodes("p") %>% # first get all the paragraphs 
  html_nodes("a") %>% # then get all the links in those paragraphs
  html_text() # get the linked text only 
##  [1] "purchasing power parity"                      
##  [2] "goods and services"                           
##  [3] "gross domestic product"                       
##  [4] "per capita"                                   
##  [5] "International Monetary Fund"                  
##  [6] "World Bank"                                   
##  [7] "national wealth"                              
##  [8] "savings"                                      
##  [9] "cost of living"                               
## [10] "List of countries by GDP (nominal) per capita"
## [11] "generalized"                                  
## [12] "living standards"                             
## [13] "inflation rates"                              
## [14] "exchange rates"                               
## [15] "[2]"                                          
## [16] "[3]"                                          
## [17] "personal income"                              
## [18] "Standard of living and GDP"                   
## [19] "Geary–Khamis dollars"                         
## [20] "rounded"                                      
## [21] "whole number"                                 
## [22] "economies"                                    
## [23] "sovereign states"                             
## [24] "dependent territories"

Key Functions: html_table

  • html_table(x, header, fill) - parse html table(s) from x into a data frame or list of data frames
  • Structure of HTML makes finding and extracting tables easy!
myhtml %>% 
  html_nodes("table") %>% # get the tables 
  head(2) # look at first 2
## {xml_nodeset (2)}
## [1] <table style="font-size:95%;">\n<tr>\n<td width="30%" align="center" ...
## [2] <table class="wikitable sortable" style="margin-left:auto;margin-rig ...
myhtml %>% 
  html_nodes("table") %>% # get the tables 
  extract2(2) %>% # pick the second one to parse
  html_table(header = TRUE) # parse table 
##     Rank                          Country    Int$
## 1      1                            Qatar 127,660
## 2      2                       Luxembourg 104,003
## 3      —                            Macau  95,151
## 4      3                        Singapore  87,855
## 5      4                           Brunei  76,884
## 6      5                           Kuwait  71,887
## 7      6                           Norway  69,249
## 8      7                          Ireland  69,231
## 9      8             United Arab Emirates  67,871
## 10     9                      Switzerland  59,561
## 11    10                       San Marino  59,058
## 12     —                        Hong Kong  58,322
## 13    11                    United States  57,436
## 14    12                     Saudi Arabia  55,158
## 15    13                      Netherlands  51,049
## 16    14                          Bahrain  50,704
## 17    15                           Sweden  49,836
## 18    16                          Iceland  49,136
## 19    17                        Australia  48,899
## 20    18                          Germany  48,111
## 21     —                           Taiwan  48,095
## 22    19                          Austria  48,005
## 23    20                          Denmark  47,985
## 24    21                             Oman  46,698
## 25    22                           Canada  46,437
## 26    23                          Belgium  45,047
## 27    24                   United Kingdom  42,481
## 28    25                           France  42,314
## 29    26                          Finland  42,165
## 30    27                            Japan  41,275
## 31    28                            Malta  39,834
## 32    29                Equatorial Guinea  38,639
## 33     —                      Puerto Rico  38,393
## 34    30                      South Korea  37,740
## 35    31                      New Zealand  37,294
## 36    32                            Italy  36,833
## 37    33                            Spain  36,416
## 38    34                           Israel  35,179
## 39    35                           Cyprus  34,970
## 40    36                   Czech Republic  33,232
## 41    37                         Slovenia  32,085
## 42    38              Trinidad and Tobago  31,870
## 43    39                         Slovakia  31,339
## 44    40                        Lithuania  29,972
## 45    41                          Estonia  29,313
## 46    42                         Portugal  28,933
## 47    43                           Poland  27,764
## 48    44                       Seychelles  27,602
## 49    45                          Hungary  27,482
## 50    46                         Malaysia  27,267
## 51    47                           Greece  26,669
## 52    48                           Russia  26,490
## 53    49            Saint Kitts and Nevis  25,940
## 54    50                           Latvia  25,710
## 55    51              Antigua and Barbuda  25,157
## 56    52                       Kazakhstan  25,145
## 57    53                           Turkey  24,912
## 58    54                     Bahamas, The  24,555
## 59    55                            Chile  24,113
## 60    56                           Panama  23,024
## 61    57                          Croatia  22,795
## 62    58                          Romania  22,348
## 63    59                          Uruguay  21,527
## 64    60                        Mauritius  20,422
## 65    61                         Bulgaria  20,327
## 66    62                        Argentina  20,047
## 67    63                            Gabon  19,056
## 68    64                           Mexico  18,938
## 69    65                          Lebanon  18,525
## 70    66                             Iran  18,077
## 71    67                          Belarus  18,000
## 72    68                             Iraq  17,944
## 73    69                     Turkmenistan  17,485
## 74    70                       Azerbaijan  17,439
## 75    71                         Barbados  17,100
## 76    72                         Botswana  17,042
## 77    73                         Thailand  16,888
## 78    74                       Montenegro  16,643
## 79    75                       Costa Rica  16,436
## 80     —                      World[7][8]  16,318
## 81    76               Dominican Republic  16,049
## 82    77                         Maldives  15,553
## 83    78                            China  15,399
## 84    79                            Palau  15,319
## 85    80                           Brazil  15,242
## 86    81                          Algeria  15,026
## 87    82                        Macedonia  14,597
## 88    83                           Serbia  14,493
## 89    84                         Colombia  14,130
## 90    85                          Grenada  14,116
## 91    86                         Suriname  13,988
## 92    87                        Venezuela  13,761
## 93    88                     South Africa  13,225
## 94    89                             Peru  12,903
## 95    90                            Egypt  12,554
## 96    91                           Jordan  12,278
## 97    92                         Mongolia  12,275
## 98    93                        Sri Lanka  12,262
## 99    94                          Albania  11,840
## 100   95                      Saint Lucia  11,783
## 101   96                        Indonesia  11,720
## 102   97                          Tunisia  11,634
## 103   98                            Nauru  11,539
## 104   99                         Dominica  11,375
## 105  100                          Namibia  11,290
## 106  101 Saint Vincent and the Grenadines  11,271
## 107  102                          Ecuador  11,109
## 108  103           Bosnia and Herzegovina  10,958
## 109    —                    Kosovo[9][10]  10,235
## 110  104                          Georgia  10,044
## 111  105                        Swaziland   9,776
## 112  106                         Paraguay   9,396
## 113  107                             Fiji   9,268
## 114  108                          Jamaica   8,976
## 115  109                      El Salvador   8,909
## 116  110                            Libya   8,678
## 117  111                          Armenia   8,621
## 118  112                          Morocco   8,330
## 119  113                          Ukraine   8,305
## 120  114                           Bhutan   8,227
## 121  115                           Belize   8,220
## 122  116                        Guatemala   7,899
## 123  117                           Guyana   7,873
## 124  118                      Philippines   7,728
## 125  119                          Bolivia   7,218
## 126  120                           Angola   6,844
## 127  121                      Congo, Rep.   6,676
## 128  122                       Cape Verde   6,662
## 129  123                            India   6,616
## 130  124                       Uzbekistan   6,563
## 131  125                          Vietnam   6,429
## 132  126                          Nigeria   5,942
## 133  127                          Myanmar   5,832
## 134  128                             Laos   5,710
## 135  129                            Samoa   5,553
## 136  130                        Nicaragua   5,452
## 137  131                            Tonga   5,386
## 138  132                          Moldova   5,328
## 139  133                         Honduras   5,271
## 140  134                         Pakistan   4,906
## 141  135                            Sudan   4,447
## 142  136                            Ghana   4,412
## 143  137                       Mauritania   4,328
## 144  138                      Timor-Leste   4,187
## 145  139                       Bangladesh   3,891
## 146  140                           Zambia   3,880
## 147  141                         Cambodia   3,737
## 148  142                    Côte d'Ivoire   3,609
## 149  143                          Lesotho   3,601
## 150  144                           Tuvalu   3,567
## 151  145                 Papua New Guinea   3,541
## 152  146                       Kyrgyzstan   3,521
## 153  147                         Djibouti   3,370
## 154  148                            Kenya   3,361
## 155  149                 Marshall Islands   3,301
## 156  150                         Cameroon   3,249
## 157  151                       Micronesia   3,234
## 158  152                         Tanzania   3,080
## 159  153            São Tomé and Príncipe   3,072
## 160  154                       Tajikistan   3,008
## 161  155                          Vanuatu   2,631
## 162  156                          Senegal   2,577
## 163  157                            Nepal   2,479
## 164  158                             Chad   2,445
## 165  159                            Yemen   2,375
## 166  160                             Mali   2,266
## 167  161                            Benin   2,119
## 168  162                           Uganda   2,068
## 169  163                           Rwanda   1,977
## 170  164                  Solomon Islands   1,973
## 171  165                         Zimbabwe   1,970
## 172  166                         Ethiopia   1,946
## 173  167                      Afghanistan   1,919
## 174  168                         Kiribati   1,823
## 175  169                            Haiti   1,784
## 176  170                     Burkina Faso   1,782
## 177  171                    Guinea-Bissau   1,730
## 178  172                     Sierra Leone   1,672
## 179  173                      Gambia, The   1,667
## 180  174                      South Sudan   1,657
## 181  175                             Togo   1,550
## 182  176                          Comoros   1,529
## 183  177                       Madagascar   1,505
## 184  178                          Eritrea   1,410
## 185  179                           Guinea   1,265
## 186  180                       Mozambique   1,215
## 187  181                           Malawi   1,134
## 188  182                            Niger   1,107
## 189  183                          Liberia     855
## 190  184                          Burundi     814
## 191  185                 Congo, Dem. Rep.     773
## 192  186         Central African Republic     652

Key functions: html_attrs

  • html_attrs(x) - extracts all attribute elements from a nodeset x
  • html_attr(x, name) - extracts the name attribute from all elements in nodeset x
  • Attributes are things in the HTML like href, title, class, style, etc.
  • Use these functions to find and extract your data
myhtml %>% 
  html_nodes("table") %>% extract2(2) %>%
  html_attrs()
##                                                  class 
##                                   "wikitable sortable" 
##                                                  style 
## "margin-left:auto;margin-right:auto;text-align: right"
myhtml %>% 
  html_nodes("p") %>% html_nodes("a") %>%
  html_attr("href")
##  [1] "/wiki/Purchasing_power_parity"                                                                 
##  [2] "/wiki/Goods_and_services"                                                                      
##  [3] "/wiki/Gross_domestic_product"                                                                  
##  [4] "/wiki/Per_capita"                                                                              
##  [5] "/wiki/International_Monetary_Fund"                                                             
##  [6] "/wiki/World_Bank"                                                                              
##  [7] "/wiki/National_wealth"                                                                         
##  [8] "/wiki/Savings"                                                                                 
##  [9] "/wiki/Cost_of_living"                                                                          
## [10] "/wiki/List_of_countries_by_GDP_(nominal)_per_capita"                                           
## [11] "https://en.wiktionary.org/wiki/generalized"                                                    
## [12] "/wiki/Living_standards"                                                                        
## [13] "/wiki/Inflation_rates"                                                                         
## [14] "/wiki/Exchange_rates"                                                                          
## [15] "#cite_note-2"                                                                                  
## [16] "#cite_note-3"                                                                                  
## [17] "/wiki/Personal_income"                                                                         
## [18] "/wiki/Gross_domestic_product#Standard_of_living_and_GDP:_Wealth_distribution_and_externalities"
## [19] "/wiki/Geary%E2%80%93Khamis_dollar"                                                             
## [20] "/wiki/Rounding"                                                                                
## [21] "/wiki/Integer"                                                                                 
## [22] "/wiki/Economy"                                                                                 
## [23] "/wiki/Sovereign_state"                                                                         
## [24] "/wiki/Dependent_territories"

Other functions

  • html_children - list the “children” of the HTML page. Can be chained like html_nodes
  • html_name - gives the tags of a nodeset. Use in a chain with html_children
myhtml %>% 
  html_children() %>% 
  html_name()
## [1] "head" "body"
  • html_form - parses HTML forms (checkboxes, fill-in-the-blanks, etc.)
  • html_session - simulate a session in an html browser; use the functions jump_to, back to navigate through the page

Your Turn #2

Find another website you want to scrape (ideas: all bills in the house so far this year, video game reviews, anything Wikipedia) and use at least 3 different rvest functions in a chain to extract some data.

Advanced Example: Inaugural Addresses

The Data

  • The Avalon Project has most of the U.S. Presidential inaugural addresses.
  • Obama & Trump’s (’13, ’17), VanBuren 1837, Buchanan 1857, Garfield 1881, and Coolidge 1925 are missing, but are easily found elsewhere. I have them saved as text files on the website.
  • Let’s scrape all of them from The Avalon Project!

Get data frame of addresses

  • Could use another source to get this data of President names and years of inaugurations, but we’ll use The Avalon Project’s site because it’s a good example of data that needs tidying.
url <- "http://avalon.law.yale.edu/subject_menus/inaug.asp"
# even though it's called "all inaugs" some are missing
all_inaugs <- (url %>% 
  read_html() %>% 
  html_nodes("table") %>% 
  html_table(fill=T, header = T)) %>% extract2(3)
# tidy table of addresses
all_inaugs_tidy <- all_inaugs %>% 
  gather(term, year, -President) %>% 
  filter(!is.na(year)) %>% 
  select(-term) %>% 
  arrange(year)
head(all_inaugs_tidy)
##           President year
## 1 George Washington 1789
## 2 George Washington 1793
## 3        John Adams 1797
## 4  Thomas Jefferson 1801
## 5  Thomas Jefferson 1805
## 6     James Madison 1809

Automate scraping

  • A function to read the addresses and get the text of the speeches, with a catch for a read error
get_inaugurations <- function(url){
  test <- try(url %>% read_html(), silent=T)
  if ("try-error" %in% class(test)) {
    return(NA)
  } else
    url %>% read_html() %>%
      html_nodes("p") %>% 
      html_text() -> address
    return(unlist(address))
}

# takes about 30 secs to run
all_inaugs_text <- all_inaugs_tidy %>% 
  mutate(address_text = (map(url, get_inaugurations))) 

all_inaugs_text$address_text[[1]]
## [1] " Fellow-Citizens of the Senate and of the House of Representatives: "                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   
## [2] "Among the vicissitudes incident to life no event could have filled me with greater anxieties than that of which the notification was transmitted by your order, and received on the 14th day of the present month. On the one hand, I was summoned by my Country, whose voice I can never hear but with veneration and love, from a retreat which I had chosen with the fondest predilection, and, in my flattering hopes, with an immutable decision, as the asylum of my declining years--a retreat which was rendered every day more necessary as well as more dear to me by the addition of habit to inclination, and of frequent interruptions in my health to the gradual waste committed on it by time. On the other hand, the magnitude and difficulty of the trust to which the voice of my country called me, being sufficient to awaken in the wisest and most experienced of her citizens a distrustful scrutiny into his qualifications, could not but overwhelm with despondence one who (inheriting inferior endowments from nature and unpracticed in the duties of civil administration) ought to be peculiarly conscious of his own deficiencies. In this conflict of emotions all I dare aver is that it has been my faithful study to collect my duty from a just appreciation of every circumstance by which it might be affected. All I dare hope is that if, in executing this task, I have been too much swayed by a grateful remembrance of former instances, or by an affectionate sensibility to this transcendent proof of the confidence of my fellow-citizens, and have thence too little consulted my incapacity as well as disinclination for the weighty and untried cares before me, my error will be palliated by the motives which mislead me, and its consequences be judged by my country with some share of the partiality in which they originated. "                                                                                                                                                                                                                                                                                                                                                                                           
## [3] "Such being the impressions under which I have, in obedience to the public summons, repaired to the present station, it would be peculiarly improper to omit in this first official act my fervent supplications to that Almighty Being who rules over the universe, who presides in the councils of nations, and whose providential aids can supply every human defect, that His benediction may consecrate to the liberties and happiness of the people of the United States a Government instituted by themselves for these essential purposes, and may enable every instrument employed in its administration to execute with success the functions allotted to his charge. In tendering this homage to the Great Author of every public and private good, I assure myself that it expresses your sentiments not less than my own, nor those of my fellow- citizens at large less than either. No people can be bound to acknowledge and adore the Invisible Hand which conducts the affairs of men more than those of the United States. Every step by which they have advanced to the character of an independent nation seems to have been distinguished by some token of providential agency; and in the important revolution just accomplished in the system of their united government the tranquil deliberations and voluntary consent of so many distinct communities from which the event has resulted can not be compared with the means by which most governments have been established without some return of pious gratitude, along with an humble anticipation of the future blessings which the past seem to presage. These reflections, arising out of the present crisis, have forced themselves too strongly on my mind to be suppressed. You will join with me, I trust, in thinking that there are none under the influence of which the proceedings of a new and free government can more auspiciously commence. "                                                                                                                                                                                                                                                                                                                                              
## [4] "By the article establishing the executive department it is made the duty of the President \"to recommend to your consideration such measures as he shall judge necessary and expedient.\" The circumstances under which I now meet you will acquit me from entering into that subject further than to refer to the great constitutional charter under which you are assembled, and which, in defining your powers, designates the objects to which your attention is to be given. It will be more consistent with those circumstances, and far more congenial with the feelings which actuate me, to substitute, in place of a recommendation of particular measures, the tribute that is due to the talents, the rectitude, and the patriotism which adorn the characters selected to devise and adopt them. In these honorable qualifications I behold the surest pledges that as on one side no local prejudices or attachments, no separate views nor party animosities, will misdirect the comprehensive and equal eye which ought to watch over this great assemblage of communities and interests, so, on another, that the foundation of our national policy will be laid in the pure and immutable principles of private morality, and the preeminence of free government be exemplified by all the attributes which can win the affections of its citizens and command the respect of the world. I dwell on this prospect with every satisfaction which an ardent love for my country can inspire, since there is no truth more thoroughly established than that there exists in the economy and course of nature an indissoluble union between virtue and happiness; between duty and advantage; between the genuine maxims of an honest and magnanimous policy and the solid rewards of public prosperity and felicity; since we ought to be no less persuaded that the propitious smiles of Heaven can never be expected on a nation that disregards the eternal rules of order and right which Heaven itself has ordained; and since the preservation of the sacred fire of liberty and the destiny of the republican model of government are justly considered, perhaps, as deeply, as finally, staked on the experiment entrusted to the hands of the American people. "
## [5] "Besides the ordinary objects submitted to your care, it will remain with your judgment to decide how far an exercise of the occasional power delegated by the fifth article of the Constitution is rendered expedient at the present juncture by the nature of objections which have been urged against the system, or by the degree of inquietude which has given birth to them. Instead of undertaking particular recommendations on this subject, in which I could be guided by no lights derived from official opportunities, I shall again give way to my entire confidence in your discernment and pursuit of the public good; for I assure myself that whilst you carefully avoid every alteration which might endanger the benefits of an united and effective government, or which ought to await the future lessons of experience, a reverence for the characteristic rights of freemen and a regard for the public harmony will sufficiently influence your deliberations on the question how far the former can be impregnably fortified or the latter be safely and advantageously promoted. "                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             
## [6] "To the foregoing observations I have one to add, which will be most properly addressed to the House of Representatives. It concerns myself, and will therefore be as brief as possible. When I was first honored with a call into the service of my country, then on the eve of an arduous struggle for its liberties, the light in which I contemplated my duty required that I should renounce every pecuniary compensation. From this resolution I have in no instance departed; and being still under the impressions which produced it, I must decline as inapplicable to myself any share in the personal emoluments which may be indispensably included in a permanent provision for the executive department, and must accordingly pray that the pecuniary estimates for the station in which I am placed may during my continuance in it be limited to such actual expenditures as the public good may be thought to require. "                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                
## [7] "Having thus imparted to you my sentiments as they have been awakened by the occasion which brings us together, I shall take my present leave; but not without resorting once more to the benign Parent of the Human Race in humble supplication that, since He has been pleased to favor the American people with opportunities for deliberating in perfect tranquillity, and dispositions for deciding with unparalleled unanimity on a form of government for the security of their union and the advancement of their happiness, so His divine blessing may be equally conspicuous in the enlarged views, the temperate consultations, and the wise measures on which the success of this Government must depend. "

Add Missings

all_inaugs_text$President[is.na(all_inaugs_text$address_text)]
## [1] "Martin Van Buren"  "James Buchanan"    "James A. Garfield"
## [4] "Calvin Coolidge"
# there are 7 missing at this point: obama's and trump's, plus coolidge, garfield, buchanan, and van buren, which errored in the scraping.
obama09 <- get_inaugurations("http://avalon.law.yale.edu/21st_century/obama.asp")
obama13 <- readLines("speeches/obama2013.txt")
trump17 <- readLines("speeches/trumpinaug.txt")
vanburen1837 <- readLines("speeches/vanburen1837.txt") # row 13
buchanan1857 <- readLines("speeches/buchanan1857.txt") # row 18
garfield1881 <- readLines("speeches/garfield1881.txt") # row 24
coolidge1925 <- readLines("speeches/coolidge1925.txt") # row 35
all_inaugs_text$address_text[c(13,18,24,35)] <- list(vanburen1837,buchanan1857, garfield1881, coolidge1925)

# lets combine them all now
recents <- data.frame(President = c(rep("Barack Obama", 2), 
                                    "Donald Trump"),
                      year = c(2009, 2013, 2017), 
                      url = NA,
                      address_text = NA)

all_inaugs_text <- rbind(all_inaugs_text, recents)
all_inaugs_text$address_text[c(56:58)] <- list(obama09, obama13, trump17)

Check-in: What did we do?

  1. We found some interesting data to scrape from the web using rvest.
  2. We used tidyr to create tidy data: A data frame of President and year. One observation per row!
  3. We used the consistent HTML structure of the urls we wanted to scrape to automate collection of web data
    • Way faster than copy-paste!
    • Though we had to do some by hand, we took advantage of the tidy data and added the missing data manually without much pain.
  4. We now have a tidy data set of Presidential inaugural addresses for text analysis!

A (Small) Text Analysis

Now, I use the tidytext package to get the words out of each inaugural address.

# install.packages("tidytext")
library(tidytext)
all_inaugs_text %>% 
  select(-url) %>% 
  unnest() %>% 
  unnest_tokens(word, address_text) -> presidential_words
head(presidential_words)
##             President year     word
## 1   George Washington 1789   fellow
## 1.1 George Washington 1789 citizens
## 1.2 George Washington 1789       of
## 1.3 George Washington 1789      the
## 1.4 George Washington 1789   senate
## 1.5 George Washington 1789      and

Longest speeches

presidential_words %>% 
  group_by(President,year) %>% 
  summarize(num_words = n()) %>%
  arrange(desc(num_words)) -> presidential_wordtotals

Web APIs

Web APIs

  • Server-side Web APIs (Application Programming Interfaces) are a popular way to provide easy access to data and other services.
  • If you (the client) want data from a server, you typically need one HTTP verb – GET.
library(httr)
sam <- GET("https://api.github.com/users/sctyner")
content(sam)[c("name", "company")]
## $name
## [1] "Sam Tyner"
## 
## $company
## [1] "Iowa State University"
  • Other HTTP verbs – POST, PUT, DELETE, etc…

Request/response model

  • When you (the client) requests a resource from the server. The server responds with a bunch of additional information.
sam$header[1:3]
## $server
## [1] "GitHub.com"
## 
## $date
## [1] "Thu, 15 Jun 2017 03:48:17 GMT"
## 
## $`content-type`
## [1] "application/json; charset=utf-8"
  • Nowadays content-type is usually XML or JSON (HTML is great for sharing content between people, but it isn’t great for exchanging data between machines.)

Non-HTML Data Formats

What is XML?

XML is a markup language that looks very similar to HTML.

<mariokart>
  <driver name="Bowser" occupation="Koopa">
    <vehicle speed="55" weight="25"> Wario Bike </vehicle>
    <vehicle speed="40" weight="67"> Piranha Prowler </vehicle>
  </driver>
  <driver name="Peach" occupation="Princess">
    <vehicle speed="54" weight="29"> Royal Racer </vehicle>
    <vehicle speed="50" weight="34"> Wild Wing </vehicle>
  </driver>
</mariokart>
  • This example shows that XML can (and is) used to store inherently tabular data (thanks Jeroen Ooms)
  • What is are the observational units here? How many observations in total?
  • 2 units and 6 total observations (4 vehicles and 2 drivers).

XML2R

XML2R is a framework to simplify acquistion of tabular/relational XML.

## # A tibble: 6 x 1
##             obs
##          <list>
## 1 <chr [1 x 3]>
## 2 <chr [1 x 3]>
## 3 <chr [1 x 2]>
## 4 <chr [1 x 3]>
## 5 <chr [1 x 3]>
## 6 <chr [1 x 2]>
## 
##          mariokart//driver mariokart//driver//vehicle 
##                          2                          4
  • XML2R coerces XML into a flat list of observations.
  • The list names track the “observational unit”.
  • The list values track the “observational attributes”.
obs # named list of observations
## $`mariokart//driver//vehicle`
##      speed weight XML_value     
## [1,] "55"  "25"   " Wario Bike "
## 
## $`mariokart//driver//vehicle`
##      speed weight XML_value          
## [1,] "40"  "67"   " Piranha Prowler "
## 
## $`mariokart//driver`
##      name     occupation
## [1,] "Bowser" "Koopa"   
## 
## $`mariokart//driver//vehicle`
##      speed weight XML_value      
## [1,] "54"  "29"   " Royal Racer "
## 
## $`mariokart//driver//vehicle`
##      speed weight XML_value    
## [1,] "50"  "34"   " Wild Wing "
## 
## $`mariokart//driver`
##      name    occupation
## [1,] "Peach" "Princess"
collapse_obs(obs) # group into table(s) by observational name/unit
## $`mariokart//driver`
##      name     occupation
## [1,] "Bowser" "Koopa"   
## [2,] "Peach"  "Princess"
## 
## $`mariokart//driver//vehicle`
##      speed weight XML_value          
## [1,] "55"  "25"   " Wario Bike "     
## [2,] "40"  "67"   " Piranha Prowler "
## [3,] "54"  "29"   " Royal Racer "    
## [4,] "50"  "34"   " Wild Wing "
  • What information have I lost?
  • I can’t map vehicles to the drivers!
library(dplyr)
obs <- add_key(obs, parent = "mariokart//driver", recycle = "name")
## A key for the following children will be generated for the mariokart//driver node:
## mariokart//driver//vehicle
collapse_obs(obs)
## $`mariokart//driver`
##      name     occupation
## [1,] "Bowser" "Koopa"   
## [2,] "Peach"  "Princess"
## 
## $`mariokart//driver//vehicle`
##      speed weight XML_value           name    
## [1,] "55"  "25"   " Wario Bike "      "Bowser"
## [2,] "40"  "67"   " Piranha Prowler " "Bowser"
## [3,] "54"  "29"   " Royal Racer "     "Peach" 
## [4,] "50"  "34"   " Wild Wing "       "Peach"

Now (if I want) I can merge the tables into a single table…

tabs <- collapse_obs(obs)
left_join(as.data.frame(tabs[[1]]), as.data.frame(tabs[[2]])) 
## Joining, by = "name"
##     name occupation speed weight         XML_value
## 1 Bowser      Koopa    55     25       Wario Bike 
## 2 Bowser      Koopa    40     67  Piranha Prowler 
## 3  Peach   Princess    54     29      Royal Racer 
## 4  Peach   Princess    50     34        Wild Wing

What about JSON?

  • JSON is the format for data on the web.
  • JavaScript Object Notation (JSON) is comprised of two components:
    • arrays => [value1, value2]
    • objects => {“key1”: value1, “key2”: [value2, value3]}
  • jsonlite is the preferred R package for parsing JSON (it’s even used by Shiny!)

Back to Mariokart

[
    {
        "driver": "Bowser",
        "occupation": "Koopa",
        "vehicles": [
            {
                "model": "Wario Bike",
                "speed": 55,
                "weight": 25
            },
            {
                "model": "Piranha Prowler",
                "speed": 40,
                "weight": 67
            }
        ]
    },
    {
        "driver": "Peach",
        "occupation": "Princess",
        "vehicles": [
            {
                "model": "Royal Racer",
                "speed": 54,
                "weight": 29
            },
            {
                "model": "Wild Wing",
                "speed": 50,
                "weight": 34
            }
        ]
    }
]
library(jsonlite)
## 
## Attaching package: 'jsonlite'
## The following object is masked from 'package:purrr':
## 
##     flatten
mario <- fromJSON("http://bit.ly/mario-json")
str(mario) 
## 'data.frame':    2 obs. of  3 variables:
##  $ driver    : chr  "Bowser" "Peach"
##  $ occupation: chr  "Koopa" "Princess"
##  $ vehicles  :List of 2
##   ..$ :'data.frame': 2 obs. of  3 variables:
##   .. ..$ model : chr  "Wario Bike" "Piranha Prowler"
##   .. ..$ speed : int  55 40
##   .. ..$ weight: int  25 67
##   ..$ :'data.frame': 2 obs. of  3 variables:
##   .. ..$ model : chr  "Royal Racer" "Wild Wing"
##   .. ..$ speed : int  54 50
##   .. ..$ weight: int  29 34
mario$driver
## [1] "Bowser" "Peach"
mario$vehicles
## [[1]]
##             model speed weight
## 1      Wario Bike    55     25
## 2 Piranha Prowler    40     67
## 
## [[2]]
##         model speed weight
## 1 Royal Racer    54     29
## 2   Wild Wing    50     34

How do we get two tables (with a common id) like the XML example?

vehicles <- rbind(mario$vehicles[[1]], mario$vehicles[[2]])
vehicles <- cbind(driver = mario$driver, vehicles)

Your Turn

  1. Get the json data for our R workshop GitHub commit history:
workshop_commits_raw <- fromJSON("https://api.github.com/repos/heike/rwrks/commits")
  1. Find the table of commits contained in this list. Hint: It’s all about the $

  2. Plot the total number of commits (number of rows) by user as a bar chart