class: center, middle, inverse, title-slide # Motivating Example ### Haley Jeppson and Sam Tyner --- ## Motivating Example - Kick off the workshop by exploring a real data set using R! -- - Goal: get the flavor of using R for data management and exploration -- - Don't worry about understanding all the coding right away - We will go back and explain how it all works in detail - Follow along using [2-MotivatingExample.R](../code/2-MotivatingExample.R) --- ## Getting Started Let's begin by installing and loading `tidyverse`: ```r install.packages("tidyverse") library(tidyverse) ``` This workshop will focus largely on a group of packages that live together under the name `tidyverse`. `tidyverse` includes the following well known packages: - `ggplot2` - `dplyr` - `tidyr` - `readr` - and more! --- ## Salmonella Shedding Data - Fecal *Salmonella* shedding recorded over the course of 21 days - Several **variables** were recorded for each pig at certain time points: - Amount of *Salmonella* in feces (log10) - Pig weight (kg) - Dietary treatment group - **Primary Question**: Do the different dietary treatments affect *Salmonella* shedding? --- ## First look at data in R Lets use R to look at the top few rows of the *Salmonella* shedding data set. First, we load tip using `read_csv`: ```r shed <- read_csv("http://heike.github.io/rwrks/01-r-intro/data/daily_shedding.csv") ``` Now, we use the `head` function to look at the first 6 rows of the data: ```r head(shed) ``` ``` ## # A tibble: 6 x 12 ## pignum time_point pig_weight temp pan_wt wet_wt Dry_wt_pan Dry_weight ## <int> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 77 2 NA 103 1.02 1.09 1.31 0.290 ## 2 87 2 NA 104 1.04 1.02 1.33 0.290 ## 3 122 2 NA 104 1.02 0.990 1.11 0.0900 ## 4 160 2 NA 104 1.09 1.00 1.36 0.270 ## 5 191 2 NA 104 1.03 1.02 1.27 0.240 ## 6 224 2 NA 103 1.08 1.00 1.39 0.310 ## # ... with 4 more variables: percent_drymatter <dbl>, ## # daily_shedding <dbl>, treatment <chr>, total_shedding <dbl> ``` --- ## *Salmonella* Shedding Data: Attributes How big is this data set and what types of variables are in each column? ```r str(shed) ``` ``` ## Classes 'tbl_df', 'tbl' and 'data.frame': 295 obs. of 12 variables: ## $ pignum : int 77 87 122 160 191 224 337 345 419 458 ... ## $ time_point : int 2 2 2 2 2 2 2 2 2 2 ... ## $ pig_weight : num NA NA NA NA NA NA NA NA NA NA ... ## $ temp : num 103 104 104 104 104 ... ## $ pan_wt : num 1.02 1.04 1.02 1.09 1.03 1.08 1.03 1.09 1.07 1.04 ... ## $ wet_wt : num 1.09 1.02 0.99 1 1.02 1 1.02 0.99 1.07 1.06 ... ## $ Dry_wt_pan : num 1.31 1.33 1.11 1.36 1.27 1.39 1.3 1.24 1.38 1.26 ... ## $ Dry_weight : num 0.29 0.29 0.09 0.27 0.24 0.31 0.27 0.15 0.31 0.22 ... ## $ percent_drymatter: num 26.61 28.43 9.09 27 23.53 ... ## $ daily_shedding : num 5.3 5.7 13.22 6.75 5.86 ... ## $ treatment : chr "control" "control" "control" "control" ... ## $ total_shedding : num 2.3 2.48 5.74 2.93 2.54 ... ## - attr(*, "spec")=List of 2 ## ..$ cols :List of 12 ## .. ..$ pignum : list() ## .. .. ..- attr(*, "class")= chr "collector_integer" "collector" ## .. ..$ time_point : list() ## .. .. ..- attr(*, "class")= chr "collector_integer" "collector" ## .. ..$ pig_weight : list() ## .. .. ..- attr(*, "class")= chr "collector_double" "collector" ## .. ..$ temp : list() ## .. .. ..- attr(*, "class")= chr "collector_double" "collector" ## .. ..$ pan_wt : list() ## .. .. ..- attr(*, "class")= chr "collector_double" "collector" ## .. ..$ wet_wt : list() ## .. .. ..- attr(*, "class")= chr "collector_double" "collector" ## .. ..$ Dry_wt_pan : list() ## .. .. ..- attr(*, "class")= chr "collector_double" "collector" ## .. ..$ Dry_weight : list() ## .. .. ..- attr(*, "class")= chr "collector_double" "collector" ## .. ..$ percent_drymatter: list() ## .. .. ..- attr(*, "class")= chr "collector_double" "collector" ## .. ..$ daily_shedding : list() ## .. .. ..- attr(*, "class")= chr "collector_double" "collector" ## .. ..$ treatment : list() ## .. .. ..- attr(*, "class")= chr "collector_character" "collector" ## .. ..$ total_shedding : list() ## .. .. ..- attr(*, "class")= chr "collector_double" "collector" ## ..$ default: list() ## .. ..- attr(*, "class")= chr "collector_guess" "collector" ## ..- attr(*, "class")= chr "col_spec" ``` --- ## *Salmonella* Shedding: Variables Let's get a summary of the values for each variable: ```r summary(shed) ``` ``` ## pignum time_point pig_weight temp ## Min. : 6.0 Min. : 0.0 Min. : 9.48 Min. :101.2 ## 1st Qu.:119.0 1st Qu.: 2.0 1st Qu.:16.02 1st Qu.:102.4 ## Median :224.0 Median : 7.0 Median :19.78 Median :102.9 ## Mean :234.7 Mean : 8.8 Mean :21.01 Mean :102.9 ## 3rd Qu.:361.0 3rd Qu.:14.0 3rd Qu.:25.29 3rd Qu.:103.4 ## Max. :472.0 Max. :21.0 Max. :36.30 Max. :105.2 ## NA's :59 NA's :118 ## pan_wt wet_wt Dry_wt_pan Dry_weight ## Min. :1.000 Min. :0.700 Min. :1.110 Min. :0.090 ## 1st Qu.:1.030 1st Qu.:1.000 1st Qu.:1.300 1st Qu.:0.250 ## Median :1.050 Median :1.020 Median :1.320 Median :0.280 ## Mean :1.054 Mean :1.026 Mean :1.322 Mean :0.268 ## 3rd Qu.:1.070 3rd Qu.:1.050 3rd Qu.:1.353 3rd Qu.:0.290 ## Max. :1.120 Max. :1.110 Max. :1.460 Max. :0.380 ## NA's :122 NA's :123 NA's :123 NA's :123 ## percent_drymatter daily_shedding treatment total_shedding ## Min. : 9.091 Min. : 0.000 Length:295 Min. : 0.000 ## 1st Qu.:24.292 1st Qu.: 0.000 Class :character 1st Qu.: 1.699 ## Median :26.923 Median : 3.912 Mode :character Median :13.904 ## Mean :26.109 Mean : 3.765 Mean :17.423 ## 3rd Qu.:28.431 3rd Qu.: 5.521 3rd Qu.:30.893 ## Max. :36.893 Max. :13.218 Max. :66.207 ## NA's :123 ``` --- ## Scatterplots Lets look at the relationship between time point and *Salmonella* shedding. ```r ggplot(shed, aes(x=time_point, y=total_shedding)) + geom_point() ``` ![](2-MotivatingExample_files/figure-html/unnamed-chunk-7-1.png)<!-- --> --- ## More Scatterplots Color the points by treatment groups ```r ggplot(shed, aes(x=time_point, y=total_shedding, colour = treatment)) + geom_point() ``` ![](2-MotivatingExample_files/figure-html/unnamed-chunk-8-1.png)<!-- --> --- ## More Scatterplots Switch to lines for easier reading ```r ggplot(shed, aes(x=time_point, y=total_shedding, colour = treatment)) + geom_line(aes(group=pignum)) ``` ![](2-MotivatingExample_files/figure-html/unnamed-chunk-9-1.png)<!-- --> --- ## Even More Plots Add facetting ```r ggplot(shed, aes(x=time_point, y=total_shedding, color=treatment)) + geom_line(aes(group=pignum)) + facet_wrap(~treatment) ``` ![](2-MotivatingExample_files/figure-html/unnamed-chunk-10-1.png)<!-- --> --- ## Variable Creation and Data Filtering We will make a new variable in the data: `gain = weight at day 21 - weight at day 0`. We will then filter the data to only include the final values. ```r final_shed <- shed %>% group_by(pignum) %>% mutate(gain = pig_weight[time_point == 21] - pig_weight[time_point == 0]) %>% filter(time_point == 21) %>% ungroup() %>% select(-c(4:9)) summary(final_shed$gain) ``` ``` ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 8.22 12.40 14.14 14.16 16.04 18.86 ``` --- ## Histogram Lets look distribution of the final shedding values with a histogram ```r ggplot(final_shed) + geom_histogram(aes(x = total_shedding)) ``` ``` ## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`. ``` ![](2-MotivatingExample_files/figure-html/unnamed-chunk-12-1.png)<!-- --> --- ## One pig did not shed at all... which pig and what treatment did it receive? ```r final_shed[which.min(final_shed$total_shedding),] ``` ``` ## # A tibble: 1 x 7 ## pignum time_point pig_weight daily_shedding treatment total_shedding ## <int> <int> <dbl> <dbl> <chr> <dbl> ## 1 97 21 24.9 0 RPS 0 ## # ... with 1 more variable: gain <dbl> ``` --- ## Shedding by Treatment Look at the average cumulative shedding value for each treatment seperately ```r # Using base `R`: median(final_shed$total_shedding[final_shed$treatment == "control"]) ``` ``` ## [1] 44.42518 ``` ```r # then repeat for each treatment. Or ... ``` ```r # Using `tidyverse`: final_shed %>% group_by(treatment) %>% summarise(med_shed = median(total_shedding)) ``` ``` ## # A tibble: 6 x 2 ## treatment med_shed ## <chr> <dbl> ## 1 Acid 35.4 ## 2 Bglu 37.6 ## 3 control 44.4 ## 4 RCS 39.4 ## 5 RPS 29.1 ## 6 Zn+Cu 44.1 ``` ] --- ## Statistical Significance There is a difference but is it statistically significant? ```r wilcox.test(total_shedding ~ treatment, data = final_shed, subset = treatment %in% c("control", "RPS")) ``` ``` ## ## Wilcoxon rank sum test ## ## data: total_shedding by treatment ## W = 75, p-value = 0.01327 ## alternative hypothesis: true location shift is not equal to 0 ``` --- ## Boxplots We could compare the cumulative shedding values of each treatment group with a side by side boxplot ```r ggplot(final_shed) + geom_boxplot( aes(treatment, total_shedding, fill = treatment)) ``` ![](2-MotivatingExample_files/figure-html/unnamed-chunk-17-1.png)<!-- --> --- ## Boxplots Again Alternatively, we could use the original data to compare the daily shedding values of each treatment group with a side by side boxplot over time ```r shed %>% filter(time_point != 0) %>% ggplot() + geom_boxplot( aes(treatment, daily_shedding, fill = treatment), position = "dodge") + facet_wrap(~time_point) ``` ![](2-MotivatingExample_files/figure-html/unnamed-chunk-18-1.png)<!-- --> --- class: inverse ## Your Turn Try playing with chunks of code from this session to further investigate the *Salmonella* shedding data: 1. Get a summary of the daily shedding values (use the `shed` data set) 2. Make side by side boxplots of final weight gain by treatment group (use the `final_shed` data set) 3. Compute a wilcox test for control vs. the "Bglu" treatment group