Reading Files

class: center, middle, inverse, title-slide

# Reading Files
### Haley Jeppson, Sam Tyner

---

## Data in Excel

- Formats xls and csv - what's the difference?

- File extensions xls and xlsx are proprietary Excel formats, binary files

- csv is an extension for Comma Separated Value files. They are text files - directly readable.

- Example: Gas prices in midwest since 1994

---

## Gas Prices in the Midwest

```r
library(readr)
midwest <- read_csv("http://heike.github.io/rwrks/03-r-format/data/midwest.csv")
head(midwest)
```

```
## # A tibble: 6 x 11
##   `Year-Month` `Week 1` X3    `Week 2` X5    `Week 3` X7    `Week 4` X9   
##   <chr>        <chr>    <chr> <chr>    <chr> <chr>    <chr> <chr>    <chr>
## 1 <NA>         End Date Value End Date Value End Date Value End Date Value
## 2 1994-Nov     <NA>     <NA>  <NA>     <NA>  <NA>     <NA>  28-Nov   1.122
## 3 1994-Dec     5-Dec    1.086 12-Dec   1.057 19-Dec   1.039 26-Dec   1.027
## 4 1995-Jan     2-Jan    1.025 9-Jan    1.046 16-Jan   1.031 23-Jan   1.054
## 5 1995-Feb     6-Feb    1.045 13-Feb   1.04  20-Feb   1.031 27-Feb   1.052
## 6 1995-Mar     6-Mar    1.053 13-Mar   1.042 20-Mar   1.048 27-Mar   1.065
## # ... with 2 more variables: `Week 5` <chr>, X11 <chr>
```

---
## `read_csv` vs. `read_*`

`read_csv` is just one way to read a file using the `readr` package:

- `read_delim`: the most generic function. Use the `delim` argument to read a file with any type of delimiter 
- `read_tsv`: read tab separated files 
- `read_lines`: read a file into a vector that has one element per line of the file
- `read_file`: read a file into a single character element
- `read_table`: read a file separated by space

---
class: inverse
## Your Turn

Have a look at the parameters of `read_csv` to solve the following problems:

1. Read the first two lines of the file into an object called `midwest_names`

2. Read everything EXCEPT the first two lines into an object called `midwest_data`

Hint: Visit http://heike.github.io/rwrks/03-r-format/data/midwest.csv to download the data and look at it in Excel to help figure out what the issue is.

---

## Quick Lesson in Data Cleaning

```r
library(lubridate)
library(tidyverse)
values <- c(midwest_data$X3, midwest_data$X5, midwest_data$X7, 
            midwest_data$X9, midwest_data$X11)
dates <- c(paste(midwest_data$X1, midwest_data$X2, sep = "-"), 
           paste(midwest_data$X1, midwest_data$X4, sep = "-"),
           paste(midwest_data$X1, midwest_data$X6, sep = "-"),
           paste(midwest_data$X1, midwest_data$X8, sep = "-"),
           paste(midwest_data$X1, midwest_data$X10, sep = "-"))

dates <- dates[!is.na(values)]
values <- values[!is.na(values)]

dates <- ymd(dates)

midwest_gas <- data_frame(date = dates, price = values)
midwest_gas <- arrange(midwest_gas, dates)
```

---

## Now we can make a plot!

```r
library(ggplot2)
ggplot(midwest_gas, aes(x = date, y = price)) + geom_line()
```

![](1-Files_files/figure-html/unnamed-chunk-4-1.png)

---

## Reading Excel Data

Download the midwest.xls file to your current working directory (`getwd()`)

```r
library(readxl)

midwest2 <- read_excel("midwest.xls")

head(midwest2)
```

Something isn't quite right here...

```
## # A tibble: 6 x 11
##   `Year-Month` `Week 1` X__1  `Week 2` X__2  `Week 3` X__3  `Week 4` X__4 
##   <chr>        <chr>    <chr> <chr>    <chr> <chr>    <chr> <chr>    <chr>
## 1 <NA>         End Date Value End Date Value End Date Value End Date Value
## 2 1994-Nov     <NA>     <NA>  <NA>     <NA>  <NA>     <NA>  39779    1.122
## 3 1994-Dec     39786    1.086 39793    1.057 39800    1.039 39807    1.027
## 4 1995-Jan     39448    1.025 39455    1.046 39462    1.031 39469    1.054
## 5 1995-Feb     39483    1.045 39490    1.04  39497    1.031 39504    1.052
## 6 1995-Mar     39512    1.053 39519    1.042 39526    1.048 39533    1.065
## # ... with 2 more variables: `Week 5` <chr>, X__5 <chr>
```

---

## Reading Excel Data: Attempt #2

The 2nd row of the excel file also contains variable names.

Let's skip the first this time. We can then rename the first column.

```r
library(readxl)

midwest2 <- read_excel("midwest.xls", skip = 1)
names(midwest2)[1] <- "Year-Month"
head(midwest2)
```

```
## # A tibble: 6 x 11
##   `Year-Month` `End Date`          Value `End Date__1`       Value__1
##   <chr>        <dttm>              <dbl> <dttm>                 <dbl>
## 1 1994-Nov     NA                  NA    NA                     NA   
## 2 1994-Dec     2012-12-05 00:00:00  1.09 2012-12-12 00:00:00     1.06
## 3 1995-Jan     2012-01-02 00:00:00  1.02 2012-01-09 00:00:00     1.05
## 4 1995-Feb     2012-02-06 00:00:00  1.04 2012-02-13 00:00:00     1.04
## 5 1995-Mar     2012-03-06 00:00:00  1.05 2012-03-13 00:00:00     1.04
## 6 1995-Apr     2012-04-03 00:00:00  1.09 2012-04-10 00:00:00     1.11
## # ... with 6 more variables: `End Date__2` <dttm>, Value__2 <dbl>, `End
## #   Date__3` <dttm>, Value__3 <dbl>, `End Date__4` <dttm>, Value__4 <dbl>
```

Better, but not yet perfect...

---

## `foreign` Package

- Other file formats can be read using the functions from package `foreign`

- SPSS: `read.spss`

- SAS: `read.xport`

- Minitab: `read.mtp`

- Systat: `read.systat`

---
class: inverse
## Your Turn (~2 minutes)

The NHANES (National Health and Nutrition Survey) publishes data in the SAS xport format:

https://wwwn.cdc.gov/Nchs/Nhanes/continuousnhanes/default.aspx?BeginYear=2013

1. Scroll to the bottom, choose one of the datasets (Demographics, Examination, etc.). Download the Data file (XPT)

2. Use `read.xport()` to load the file into R

3. Briefly examine the dataset you've imported (use `head` or `tail`, etc)