Kick off the workshop by exploring a real data set using R!
Goal: get the flavor of using R for data management and exploration
Kick off the workshop by exploring a real data set using R!
Goal: get the flavor of using R for data management and exploration
Don't worry about understanding all the coding right away
We will go back and explain how it all works in detail
Follow along using 2-MotivatingExample.R
Let's begin by installing and loading tidyverse
:
install.packages("tidyverse")library(tidyverse)
This workshop will focus largely on a group of packages that live together under the name tidyverse
.
tidyverse
includes the following well known packages:
ggplot2
dplyr
tidyr
readr
Fecal Salmonella shedding recorded over the course of 21 days
Several variables were recorded for each pig at certain time points:
Primary Question: Do the different dietary treatments affect Salmonella shedding?
Lets use R to look at the top few rows of the Salmonella shedding data set. First, we load tip using read_csv
:
shed <- read_csv("http://heike.github.io/rwrks/01-r-intro/data/daily_shedding.csv")
Now, we use the head
function to look at the first 6 rows of the data:
head(shed)
## # A tibble: 6 x 12## pignum time_point pig_weight temp pan_wt wet_wt Dry_wt_pan Dry_weight## <int> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>## 1 77 2 NA 103 1.02 1.09 1.31 0.290 ## 2 87 2 NA 104 1.04 1.02 1.33 0.290 ## 3 122 2 NA 104 1.02 0.990 1.11 0.0900## 4 160 2 NA 104 1.09 1.00 1.36 0.270 ## 5 191 2 NA 104 1.03 1.02 1.27 0.240 ## 6 224 2 NA 103 1.08 1.00 1.39 0.310 ## # ... with 4 more variables: percent_drymatter <dbl>,## # daily_shedding <dbl>, treatment <chr>, total_shedding <dbl>
How big is this data set and what types of variables are in each column?
str(shed)
## Classes 'tbl_df', 'tbl' and 'data.frame': 295 obs. of 12 variables:## $ pignum : int 77 87 122 160 191 224 337 345 419 458 ...## $ time_point : int 2 2 2 2 2 2 2 2 2 2 ...## $ pig_weight : num NA NA NA NA NA NA NA NA NA NA ...## $ temp : num 103 104 104 104 104 ...## $ pan_wt : num 1.02 1.04 1.02 1.09 1.03 1.08 1.03 1.09 1.07 1.04 ...## $ wet_wt : num 1.09 1.02 0.99 1 1.02 1 1.02 0.99 1.07 1.06 ...## $ Dry_wt_pan : num 1.31 1.33 1.11 1.36 1.27 1.39 1.3 1.24 1.38 1.26 ...## $ Dry_weight : num 0.29 0.29 0.09 0.27 0.24 0.31 0.27 0.15 0.31 0.22 ...## $ percent_drymatter: num 26.61 28.43 9.09 27 23.53 ...## $ daily_shedding : num 5.3 5.7 13.22 6.75 5.86 ...## $ treatment : chr "control" "control" "control" "control" ...## $ total_shedding : num 2.3 2.48 5.74 2.93 2.54 ...## - attr(*, "spec")=List of 2## ..$ cols :List of 12## .. ..$ pignum : list()## .. .. ..- attr(*, "class")= chr "collector_integer" "collector"## .. ..$ time_point : list()## .. .. ..- attr(*, "class")= chr "collector_integer" "collector"## .. ..$ pig_weight : list()## .. .. ..- attr(*, "class")= chr "collector_double" "collector"## .. ..$ temp : list()## .. .. ..- attr(*, "class")= chr "collector_double" "collector"## .. ..$ pan_wt : list()## .. .. ..- attr(*, "class")= chr "collector_double" "collector"## .. ..$ wet_wt : list()## .. .. ..- attr(*, "class")= chr "collector_double" "collector"## .. ..$ Dry_wt_pan : list()## .. .. ..- attr(*, "class")= chr "collector_double" "collector"## .. ..$ Dry_weight : list()## .. .. ..- attr(*, "class")= chr "collector_double" "collector"## .. ..$ percent_drymatter: list()## .. .. ..- attr(*, "class")= chr "collector_double" "collector"## .. ..$ daily_shedding : list()## .. .. ..- attr(*, "class")= chr "collector_double" "collector"## .. ..$ treatment : list()## .. .. ..- attr(*, "class")= chr "collector_character" "collector"## .. ..$ total_shedding : list()## .. .. ..- attr(*, "class")= chr "collector_double" "collector"## ..$ default: list()## .. ..- attr(*, "class")= chr "collector_guess" "collector"## ..- attr(*, "class")= chr "col_spec"
Let's get a summary of the values for each variable:
summary(shed)
## pignum time_point pig_weight temp ## Min. : 6.0 Min. : 0.0 Min. : 9.48 Min. :101.2 ## 1st Qu.:119.0 1st Qu.: 2.0 1st Qu.:16.02 1st Qu.:102.4 ## Median :224.0 Median : 7.0 Median :19.78 Median :102.9 ## Mean :234.7 Mean : 8.8 Mean :21.01 Mean :102.9 ## 3rd Qu.:361.0 3rd Qu.:14.0 3rd Qu.:25.29 3rd Qu.:103.4 ## Max. :472.0 Max. :21.0 Max. :36.30 Max. :105.2 ## NA's :59 NA's :118 ## pan_wt wet_wt Dry_wt_pan Dry_weight ## Min. :1.000 Min. :0.700 Min. :1.110 Min. :0.090 ## 1st Qu.:1.030 1st Qu.:1.000 1st Qu.:1.300 1st Qu.:0.250 ## Median :1.050 Median :1.020 Median :1.320 Median :0.280 ## Mean :1.054 Mean :1.026 Mean :1.322 Mean :0.268 ## 3rd Qu.:1.070 3rd Qu.:1.050 3rd Qu.:1.353 3rd Qu.:0.290 ## Max. :1.120 Max. :1.110 Max. :1.460 Max. :0.380 ## NA's :122 NA's :123 NA's :123 NA's :123 ## percent_drymatter daily_shedding treatment total_shedding ## Min. : 9.091 Min. : 0.000 Length:295 Min. : 0.000 ## 1st Qu.:24.292 1st Qu.: 0.000 Class :character 1st Qu.: 1.699 ## Median :26.923 Median : 3.912 Mode :character Median :13.904 ## Mean :26.109 Mean : 3.765 Mean :17.423 ## 3rd Qu.:28.431 3rd Qu.: 5.521 3rd Qu.:30.893 ## Max. :36.893 Max. :13.218 Max. :66.207 ## NA's :123
Lets look at the relationship between time point and Salmonella shedding.
ggplot(shed, aes(x=time_point, y=total_shedding)) + geom_point()
Color the points by treatment groups
ggplot(shed, aes(x=time_point, y=total_shedding, colour = treatment)) + geom_point()
Switch to lines for easier reading
ggplot(shed, aes(x=time_point, y=total_shedding, colour = treatment)) + geom_line(aes(group=pignum))
Add facetting
ggplot(shed, aes(x=time_point, y=total_shedding, color=treatment)) + geom_line(aes(group=pignum)) + facet_wrap(~treatment)
We will make a new variable in the data: gain = weight at day 21 - weight at day 0
.
We will then filter the data to only include the final values.
final_shed <- shed %>% group_by(pignum) %>% mutate(gain = pig_weight[time_point == 21] - pig_weight[time_point == 0]) %>% filter(time_point == 21) %>% ungroup() %>% select(-c(4:9))summary(final_shed$gain)
## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 8.22 12.40 14.14 14.16 16.04 18.86
Lets look distribution of the final shedding values with a histogram
ggplot(final_shed) + geom_histogram(aes(x = total_shedding))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
which pig and what treatment did it receive?
final_shed[which.min(final_shed$total_shedding),]
## # A tibble: 1 x 7## pignum time_point pig_weight daily_shedding treatment total_shedding## <int> <int> <dbl> <dbl> <chr> <dbl>## 1 97 21 24.9 0 RPS 0## # ... with 1 more variable: gain <dbl>
Look at the average cumulative shedding value for each treatment seperately
# Using base `R`:median(final_shed$total_shedding[final_shed$treatment == "control"])
## [1] 44.42518
# then repeat for each treatment. Or ...
# Using `tidyverse`:final_shed %>% group_by(treatment) %>% summarise(med_shed = median(total_shedding))
## # A tibble: 6 x 2## treatment med_shed## <chr> <dbl>## 1 Acid 35.4## 2 Bglu 37.6## 3 control 44.4## 4 RCS 39.4## 5 RPS 29.1## 6 Zn+Cu 44.1
]
There is a difference but is it statistically significant?
wilcox.test(total_shedding ~ treatment, data = final_shed, subset = treatment %in% c("control", "RPS"))
## ## Wilcoxon rank sum test## ## data: total_shedding by treatment## W = 75, p-value = 0.01327## alternative hypothesis: true location shift is not equal to 0
We could compare the cumulative shedding values of each treatment group with a side by side boxplot
ggplot(final_shed) + geom_boxplot( aes(treatment, total_shedding, fill = treatment))
Alternatively, we could use the original data to compare the daily shedding values of each treatment group with a side by side boxplot over time
shed %>% filter(time_point != 0) %>% ggplot() + geom_boxplot( aes(treatment, daily_shedding, fill = treatment), position = "dodge") + facet_wrap(~time_point)
Try playing with chunks of code from this session to further investigate the Salmonella shedding data:
Get a summary of the daily shedding values (use the shed
data set)
Make side by side boxplots of final weight gain by treatment group (use the final_shed
data set)
Compute a wilcox test for control vs. the "Bglu" treatment group
Keyboard shortcuts
↑, ←, Pg Up, k | Go to previous slide |
↓, →, Pg Dn, Space, j | Go to next slide |
Home | Go to first slide |
End | Go to last slide |
Number + Return | Go to specific slide |
b / m / f | Toggle blackout / mirrored / fullscreen mode |
c | Clone slideshow |
p | Toggle presenter mode |
t | Restart the presentation timer |
?, h | Toggle this help |
Esc | Back to slideshow |