## A web of data
- In 2008, [an estimated](http://yz.mit.edu/papers/webtables-vldb08.pdf) **154 million HTML tables** (out of the 14.1 billion) contain 'high quality relational data'!!!
- Hard to quantify how much more exists outside of HTML Tables, but there is [an estimate](https://cs.uwaterloo.ca/~x4chu/SIGMOD2015_1.pdf) of **at least 30 million lists** with 'high quality relational data'.
- A growing number of websites/companies [provide programmatic access](http://www.programmableweb.com/category/all/apis?order=field_popularity) to their data/services via web APIs (that data typically comes in XML/JSON format).
## Before scraping, do some googling!
- If resource is well-known (e.g. Twitter, Fitbit, etc.), there is *probably* an existing R package for it.
- [ropensci](https://ropensci.org/) has a [ton of R packages](https://ropensci.org/packages/) providing easy-to-use interfaces to open data.
- The [Web Technologies and Services CRAN Task View](http://cran.r-project.org/web/views/WebTechnologies.html) is a great overview of various tools for working with data that lives on the web in R.
## A web of *messy* data!
- Recall the concept of [tidy data](http://vita.had.co.nz/papers/tidy-data.pdf).
- Data is in a table where
* 1 row == 1 observation
* 1 column == 1 variable (observational attribute)
- Parsing web data (HTML/XML/JSON) is easy (for computers)
- Getting it in a tidy form is typically *not easy*.
- Knowing a bit about modern tools & web technologies makes it *much* easier.
## What is webscraping?
- Extract data from websites
+ Tables
+ Links to other websites
+ Text
```{r echo=FALSE, out.width='33%', fig.show='hold', fig.align='default'}
knitr::include_graphics(c('./images/gdpss.png','./images/cropsss.png','./images/gass.png'), auto_pdf = FALSE)
```
## Why webscrape?
>- Because copy-paste is awful
```{r echo=FALSE, out.width='50%'}
knitr::include_graphics("./images/copypastesucks.png", auto_pdf = FALSE)
```
>- Because it's fast
>- Because you can automate it
## Don't abuse your power
- Before scraping a website, please read the terms and conditions!!
- It's sometimes more efficient/appropriate to [find the API](http://www.gregreda.com/2015/02/15/web-scraping-finding-the-api/).
- If a website public offers an API, USE IT (instead of scraping)!!! (more on this later)
http://www.wired.com/2014/01/how-to-hack-okcupid
http://www.vox.com/2016/5/12/11666116/70000-okcupid-users-data-release
# Webscraping with `rvest`:
Step-by-Step Start Guide
## Step 1: Find a URL
What data do you want?
>- Information on Oscar-~~nominated~~ winning film Moonlight
Find it on the web!
>- [IMDB page](http://www.imdb.com/title/tt4975722/)
```{r url}
# character variable containing the url you want to scrape
myurl <- "http://www.imdb.com/title/tt4975722/"
```
## Step 2: Read HTML into `R`
> "Huh? What am I doing?" - some of you right now
>- HTML is HyperText Markup Language. All webpages are written with it.
>- Go to any [website](http://www.imdb.com/title/tt4975722/), right click, click "View Page Source" to see the HTML
```{r gethtml, message = FALSE}
library(tidyverse)
library(rvest)
myhtml <- read_html(myurl)
myhtml
```
## Step 3: Figure out
where your data is {.smaller}
Need to find your data within the `myhtml` object.
Tags to look for:
- `
`: paragraphs - `