## A web of data - In 2008, [an estimated](http://yz.mit.edu/papers/webtables-vldb08.pdf) **154 million HTML tables** (out of the 14.1 billion) contain 'high quality relational data'!!! - Hard to quantify how much more exists outside of HTML Tables, but there is [an estimate](https://cs.uwaterloo.ca/~x4chu/SIGMOD2015_1.pdf) of **at least 30 million lists** with 'high quality relational data'. - A growing number of websites/companies [provide programmatic access](http://www.programmableweb.com/category/all/apis?order=field_popularity) to their data/services via web APIs (that data typically comes in XML/JSON format). ## Before scraping, do some googling! - If resource is well-known (e.g. Twitter, Fitbit, etc.), there is *probably* an existing R package for it. - [ropensci](https://ropensci.org/) has a [ton of R packages](https://ropensci.org/packages/) providing easy-to-use interfaces to open data. - The [Web Technologies and Services CRAN Task View](http://cran.r-project.org/web/views/WebTechnologies.html) is a great overview of various tools for working with data that lives on the web in R. ## A web of *messy* data! - Recall the concept of [tidy data](http://vita.had.co.nz/papers/tidy-data.pdf). - Data is in a table where * 1 row == 1 observation * 1 column == 1 variable (observational attribute) - Parsing web data (HTML/XML/JSON) is easy (for computers) - Getting it in a tidy form is typically *not easy*. - Knowing a bit about modern tools & web technologies makes it *much* easier. ## What is webscraping? - Extract data from websites + Tables + Links to other websites + Text ```{r echo=FALSE, out.width='33%', fig.show='hold', fig.align='default'} knitr::include_graphics(c('./images/gdpss.png','./images/cropsss.png','./images/gass.png'), auto_pdf = FALSE) ``` ## Why webscrape? >- Because copy-paste is awful ```{r echo=FALSE, out.width='50%'} knitr::include_graphics("./images/copypastesucks.png", auto_pdf = FALSE) ``` >- Because it's fast >- Because you can automate it ## Don't abuse your power - Before scraping a website, please read the terms and conditions!! - It's sometimes more efficient/appropriate to [find the API](http://www.gregreda.com/2015/02/15/web-scraping-finding-the-api/). - If a website public offers an API, USE IT (instead of scraping)!!! (more on this later) http://www.wired.com/2014/01/how-to-hack-okcupid http://www.vox.com/2016/5/12/11666116/70000-okcupid-users-data-release # Webscraping with `rvest`:
Step-by-Step Start Guide ## Step 1: Find a URL What data do you want? >- Information on Oscar-~~nominated~~ winning film Moonlight Find it on the web! >- [IMDB page](http://www.imdb.com/title/tt4975722/) ```{r url} # character variable containing the url you want to scrape myurl <- "http://www.imdb.com/title/tt4975722/" ``` ## Step 2: Read HTML into `R` > "Huh? What am I doing?" - some of you right now >- HTML is HyperText Markup Language. All webpages are written with it. >- Go to any [website](http://www.imdb.com/title/tt4975722/), right click, click "View Page Source" to see the HTML ```{r gethtml, message = FALSE} library(tidyverse) library(rvest) myhtml <- read_html(myurl) myhtml ``` ## Step 3: Figure out
where your data is {.smaller} Need to find your data within the `myhtml` object. Tags to look for: - `

`: paragraphs - `

`, `

`, etc.: headers - ``: links - `
  • `: item in a list - ``: tables Use [Selector Gadget](http://selectorgadget.com/) to find the exact location. (Demo) For more on HTML, I recommend [W3schools' tutorial](http://www.w3schools.com/html/html_intro.asp) >- You don't need to be an expert in HTML to webscrape with `rvest`! ## Step 4: Tell `rvest` where to find your data Copy-paste from Selector Gadget or give HTML tags into `html_nodes()` to extract your data of interest ```{r getdesc} myhtml %>% html_nodes(".summary_text") %>% html_text() ``` ```{r gettable} myhtml %>% html_nodes("table") %>% html_table(header = TRUE) ``` ## Step 5: Save & tidy data {.smaller} ```{r savetidy, message = FALSE, warning = FALSE} library(stringr) library(magrittr) mydat <- myhtml %>% html_nodes("table") %>% extract2(1) %>% html_table(header = TRUE) mydat <- mydat[,c(2,4)] names(mydat) <- c("Actor", "Role") mydat <- mydat %>% mutate(Actor = Actor, Role = str_replace_all(Role, "\n ", "")) mydat ``` ## Your Turn #1 Using `rvest`, scrape a table from Wikipedia. You can pick your own table or you can get one of the tables in the country [GDP per capita](https://en.wikipedia.org/wiki/List_of_countries_by_GDP_(PPP)_per_capita) example from earlier. Your result should be a data frame with one observation per row and one variable per column. ## Your Turn #1 Solution ```{r yourturn1} library(rvest) library(magrittr) myurl <- "https://en.wikipedia.org/wiki/List_of_countries_by_GDP_(PPP)_per_capita" myhtml <- read_html(myurl) myhtml %>% html_nodes("table") %>% extract2(2) %>% html_table(header = TRUE) %>% mutate(`Int$` = parse_number(`Int$`)) %>% head ``` # Deeper dive into `rvest` ## Key Functions: `html_nodes` - `html_nodes(x, "path")` extracts all elements from the page `x` that have the tag / class / id `path`. (Use SelectorGadget to determine `path`.) - `html_node()` does the same thing but only returns the first matching element. - Can be chained ```{r nodesex} myhtml %>% html_nodes("p") %>% # first get all the paragraphs html_nodes("a") # then get all the links in those paragraphs ``` ## Key Functions: `html_text` - `html_text(x)` extracts all text from the nodeset `x` - Good for cleaning output ```{r textex} myhtml %>% html_nodes("p") %>% # first get all the paragraphs html_nodes("a") %>% # then get all the links in those paragraphs html_text() # get the linked text only ``` ## Key Functions: `html_table` {.smaller} - `html_table(x, header, fill)` - parse html table(s) from `x` into a data frame or list of data frames - Structure of HTML makes finding and extracting tables easy! ```{r tableex} myhtml %>% html_nodes("table") %>% # get the tables head(2) # look at first 2 ``` ```{r tableex2} myhtml %>% html_nodes("table") %>% # get the tables extract2(2) %>% # pick the second one to parse html_table(header = TRUE) # parse table ``` ## Key functions: `html_attrs` - `html_attrs(x)` - extracts all attribute elements from a nodeset `x` - `html_attr(x, name)` - extracts the `name` attribute from all elements in nodeset `x` - Attributes are things in the HTML like `href`, `title`, `class`, `style`, etc. - Use these functions to find and extract your data ```{r attrsex} myhtml %>% html_nodes("table") %>% extract2(2) %>% html_attrs() ``` ```{r attrsex2} myhtml %>% html_nodes("p") %>% html_nodes("a") %>% html_attr("href") ``` ## Other functions - `html_children` - list the "children" of the HTML page. Can be chained like `html_nodes` - `html_name` - gives the tags of a nodeset. Use in a chain with `html_children` ```{r childex} myhtml %>% html_children() %>% html_name() ``` - `html_form` - parses HTML forms (checkboxes, fill-in-the-blanks, etc.) - `html_session` - simulate a session in an html browser; use the functions `jump_to`, `back` to navigate through the page ## Your Turn #2 Find another website you want to scrape (ideas: [all bills in the house so far this year](https://www.congress.gov/search), [video game reviews](http://www.metacritic.com/game), anything Wikipedia) and use *at least* 3 different `rvest` functions in a chain to extract some data. # Advanced Example: Inaugural Addresses ## The Data - [The Avalon Project](http://avalon.law.yale.edu/subject_menus/inaug.asp) has most of the U.S. Presidential inaugural addresses. - Obama & Trump's ('13, '17), VanBuren 1837, Buchanan 1857, Garfield 1881, and Coolidge 1925 are missing, but are easily found elsewhere. I have them saved as text files on the website. - Let's scrape all of them from The Avalon Project! ## Get data frame of addresses - Could use another source to get this data of President names and years of inaugurations, but we'll use The Avalon Project's site because it's a good example of data that needs tidying. ```{r getyears} url <- "http://avalon.law.yale.edu/subject_menus/inaug.asp" # even though it's called "all inaugs" some are missing all_inaugs <- (url %>% read_html() %>% html_nodes("table") %>% html_table(fill=T, header = T)) %>% extract2(3) # tidy table of addresses all_inaugs_tidy <- all_inaugs %>% gather(term, year, -President) %>% filter(!is.na(year)) %>% select(-term) %>% arrange(year) head(all_inaugs_tidy) ``` ## Get links to visit & scrape ```{r getlinks} # get the links to the addresses inaugadds_adds <- (url %>% read_html() %>% html_nodes("a") %>% html_attr("href"))[12:66] # create the urls to scrape urlstump <- "http://avalon.law.yale.edu/" inaugurls <- paste0(urlstump, str_replace(inaugadds_adds, "../", "")) all_inaugs_tidy$url <- inaugurls head(all_inaugs_tidy) ``` ## Automate scraping - A function to read the addresses and get the text of the speeches, with a catch for a read error ```{r functiongetspeech, cache=TRUE, message = FALSE, warning = FALSE} get_inaugurations <- function(url){ test <- try(url %>% read_html(), silent=T) if ("try-error" %in% class(test)) { return(NA) } else url %>% read_html() %>% html_nodes("p") %>% html_text() -> address return(unlist(address)) } # takes about 30 secs to run all_inaugs_text <- all_inaugs_tidy %>% mutate(address_text = (map(url, get_inaugurations))) all_inaugs_text$address_text[[1]] ``` ## Add Missings ```{r missings} all_inaugs_text$President[is.na(all_inaugs_text$address_text)] # there are 7 missing at this point: obama's and trump's, plus coolidge, garfield, buchanan, and van buren, which errored in the scraping. obama09 <- get_inaugurations("http://avalon.law.yale.edu/21st_century/obama.asp") obama13 <- readLines("speeches/obama2013.txt") trump17 <- readLines("speeches/trumpinaug.txt") vanburen1837 <- readLines("speeches/vanburen1837.txt") # row 13 buchanan1857 <- readLines("speeches/buchanan1857.txt") # row 18 garfield1881 <- readLines("speeches/garfield1881.txt") # row 24 coolidge1925 <- readLines("speeches/coolidge1925.txt") # row 35 all_inaugs_text$address_text[c(13,18,24,35)] <- list(vanburen1837,buchanan1857, garfield1881, coolidge1925) # lets combine them all now recents <- data.frame(President = c(rep("Barack Obama", 2), "Donald Trump"), year = c(2009, 2013, 2017), url = NA, address_text = NA) all_inaugs_text <- rbind(all_inaugs_text, recents) all_inaugs_text$address_text[c(56:58)] <- list(obama09, obama13, trump17) ``` ## Check-in: What did we do? {.smaller} 1. We found some interesting data to scrape from the web using `rvest`. 2. We used `tidyr` to create tidy data: A data frame of President and year. One observation per row! 3. We used the consistent HTML structure of the urls we wanted to scrape to automate collection of web data + Way faster than copy-paste! + Though we had to do some by hand, we took advantage of the tidy data and added the missing data manually without much pain. 4. We now have a tidy data set of Presidential inaugural addresses for text analysis! ## A (Small) Text Analysis Now, I use the [`tidytext`](http://tidytextmining.com/) package to get the words out of each inaugural address. ```{r textanalysis} # install.packages("tidytext") library(tidytext) all_inaugs_text %>% select(-url) %>% unnest() %>% unnest_tokens(word, address_text) -> presidential_words head(presidential_words) ``` ## Longest speeches ```{r longestspeech} presidential_words %>% group_by(President,year) %>% summarize(num_words = n()) %>% arrange(desc(num_words)) -> presidential_wordtotals ``` ```{r speechplot, echo = FALSE, fig.align='center', cache = T, fig.height=5, fig.width=9.5} library(ggrepel) ggplot(presidential_wordtotals) + geom_bar(aes(x = reorder(year, num_words), y = num_words), stat = "identity", fill = 'white', color = 'black') + geom_label_repel(aes(x = reorder(year, num_words), y = num_words, label = President), size = 2.5) + labs(y = "Word count of Speech", x = "Year (sorted by word count)", title = "Length of Presidential Inaugural Addresses", subtitle = "Max: 8,459; Min: 135; Median: 2,090; Mean: 2,341") + theme(axis.text.x = element_text(angle = 45, size = 7), plot.subtitle = element_text(hjust = .5)) ``` # Web APIs ## Web APIs {.build} - [Server-side Web APIs](https://en.wikipedia.org/wiki/Web_API#Server-side) (Application Programming Interfaces) are a popular way to provide easy access to data and other services. - If you (the client) want data from a server, you typically need one HTTP verb -- `GET`. ```{r} library(httr) sam <- GET("https://api.github.com/users/sctyner") content(sam)[c("name", "company")] ``` - Other HTTP verbs -- `POST`, `PUT`, `DELETE`, etc... ## Request/response model {.build} - When you (the client) _requests_ a resource from the server. The server _responds_ with a bunch of additional information. ```{r} sam$header[1:3] ``` - Nowadays content-type is usually XML or JSON (HTML is great for _sharing content_ between _people_, but it isn't great for _exchanging data_ between _machines_.) # Non-HTML Data Formats ## What is XML? XML is a markup language that looks very similar to HTML. ```xml Wario Bike Piranha Prowler Royal Racer Wild Wing ``` - This example shows that XML can (and is) used to store inherently tabular data ([thanks Jeroen Ooms](http://arxiv.org/pdf/1403.2805v1.pdf)) - What is are the observational units here? How many observations in total? - 2 units and 6 total observations (4 vehicles and 2 drivers). ## XML2R {.build} [XML2R](https://github.com/cpsievert/XML2R) is a framework to simplify acquistion of tabular/relational XML. ```{r, echo=FALSE, message = FALSE, warning=FALSE} # hopefully no one is watching # devtools::install_github("cpsievert/XML2R") library(XML2R) #obs <- read_xml("https://gist.githubusercontent.com/cpsievert/85e340814cb855a60dc4/raw/651b7626e34751c7485cff2d7ea3ea66413609b8/mariokart.xml") obs <- XML2Obs("https://gist.githubusercontent.com/cpsievert/85e340814cb855a60dc4/raw/651b7626e34751c7485cff2d7ea3ea66413609b8/mariokart.xml", quiet = TRUE) data_frame(obs) table(names(obs)) ``` * **XML2R** coerces XML into a *flat* list of observations. * The list names track the "observational unit". * The list values track the "observational attributes". --- ```{r} obs # named list of observations ``` --- ```{r} collapse_obs(obs) # group into table(s) by observational name/unit ``` - What information have I lost? - I can't map vehicles to the drivers! --- ```{r} library(dplyr) obs <- add_key(obs, parent = "mariokart//driver", recycle = "name") collapse_obs(obs) ``` --- Now (if I want) I can merge the tables into a single table... ```{r} tabs <- collapse_obs(obs) left_join(as.data.frame(tabs[[1]]), as.data.frame(tabs[[2]])) ``` ## What about JSON? {.build} - JSON is _the_ format for data on the web. - JavaScript Object Notation (JSON) is comprised of two components: * arrays => [value1, value2] * objects => {"key1": value1, "key2": [value2, value3]} - [jsonlite](http://cran.r-project.org/web/packages/jsonlite/index.html) is the preferred R package for parsing JSON (it's even used by Shiny!) ## Back to Mariokart {.smaller} ```json [ { "driver": "Bowser", "occupation": "Koopa", "vehicles": [ { "model": "Wario Bike", "speed": 55, "weight": 25 }, { "model": "Piranha Prowler", "speed": 40, "weight": 67 } ] }, { "driver": "Peach", "occupation": "Princess", "vehicles": [ { "model": "Royal Racer", "speed": 54, "weight": 29 }, { "model": "Wild Wing", "speed": 50, "weight": 34 } ] } ] ``` --- ```{r} library(jsonlite) mario <- fromJSON("http://bit.ly/mario-json") str(mario) ``` --- ```{r} mario$driver mario$vehicles ``` How do we get two tables (with a common id) like the XML example? ```{r} vehicles <- rbind(mario$vehicles[[1]], mario$vehicles[[2]]) vehicles <- cbind(driver = mario$driver, vehicles) ``` ## Your Turn {data-background=#527a7a} 1. Get the json data for our R workshop GitHub commit history: ```{r} workshop_commits_raw <- fromJSON("https://api.github.com/repos/heike/rwrks/commits") ``` 2. Find the table of commits contained in this list. Hint: It's all about the `$` 3. Plot the total number of commits (number of rows) by user as a bar chart