- Kick things off by looking at a real data set in R
- Goal: get the flavor of using R for data management and exploration
- Don’t worry about understanding all the coding right away
- We will come back to see how everything works in detail
2022-05-16
Allison Horst: data on 344 penguins from three different species and three islands over three years
data consists of observations on nesting penguins
several variables recorded:
Primary Question: Can sex be determined without blood test?
Copy & paste the code into the console:
curl::curl_download( "https://raw.githubusercontent.com/heike/summerschool-2022/master/01-Introduction-to-R/code/1-example.R", "1-example.R" ) file.edit("1-example.R")
Lets use R to look at the top few rows of the data set. First, we load the penguins data read.csv
:
penguins <- read.csv("https://raw.githubusercontent.com/heike/summerschool-2022/main/01-Introduction-to-R/data/penguins.csv", stringsAsFactors = TRUE)
Now, we use the head
function to look at the first 6 rows of the data:
head(penguins)
## species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g ## 1 Adelie Torgersen 39.1 18.7 181 3750 ## 2 Adelie Torgersen 39.5 17.4 186 3800 ## 3 Adelie Torgersen 40.3 18.0 195 3250 ## 4 Adelie Torgersen NA NA NA NA ## 5 Adelie Torgersen 36.7 19.3 193 3450 ## 6 Adelie Torgersen 39.3 20.6 190 3650 ## sex year ## 1 male 2007 ## 2 female 2007 ## 3 female 2007 ## 4 <NA> 2007 ## 5 female 2007 ## 6 male 2007
How big is this data set and what types of variables are in each column?
str(penguins)
## 'data.frame': 344 obs. of 8 variables: ## $ species : Factor w/ 3 levels "Adelie","Chinstrap",..: 1 1 1 1 1 1 1 1 1 1 ... ## $ island : Factor w/ 3 levels "Biscoe","Dream",..: 3 3 3 3 3 3 3 3 3 3 ... ## $ bill_length_mm : num 39.1 39.5 40.3 NA 36.7 39.3 38.9 39.2 34.1 42 ... ## $ bill_depth_mm : num 18.7 17.4 18 NA 19.3 20.6 17.8 19.6 18.1 20.2 ... ## $ flipper_length_mm: int 181 186 195 NA 193 190 181 195 193 190 ... ## $ body_mass_g : int 3750 3800 3250 NA 3450 3650 3625 4675 3475 4250 ... ## $ sex : Factor w/ 2 levels "female","male": 2 1 1 NA 1 2 1 2 NA NA ... ## $ year : int 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 ...
Let’s get a summary of the values for each variable in penguins:
summary(penguins)
## species island bill_length_mm bill_depth_mm ## Adelie :152 Biscoe :168 Min. :32.10 Min. :13.10 ## Chinstrap: 68 Dream :124 1st Qu.:39.23 1st Qu.:15.60 ## Gentoo :124 Torgersen: 52 Median :44.45 Median :17.30 ## Mean :43.92 Mean :17.15 ## 3rd Qu.:48.50 3rd Qu.:18.70 ## Max. :59.60 Max. :21.50 ## NA's :2 NA's :2 ## flipper_length_mm body_mass_g sex year ## Min. :172.0 Min. :2700 female:165 Min. :2007 ## 1st Qu.:190.0 1st Qu.:3550 male :168 1st Qu.:2007 ## Median :197.0 Median :4050 NA's : 11 Median :2008 ## Mean :200.9 Mean :4202 Mean :2008 ## 3rd Qu.:213.0 3rd Qu.:4750 3rd Qu.:2009 ## Max. :231.0 Max. :6300 Max. :2009 ## NA's :2 NA's :2
First, we need to install and load ggplot2
:
install.packages("ggplot2") library(ggplot2) library(dplyr)
penguins %>% ggplot(aes(x = bill_length_mm, y = flipper_length_mm)) + geom_point()
There seem to be some clusters
Correlation is below 0.70, but there is a positive linear relationship in some clusters
cor(penguins$bill_length_mm, penguins$flipper_length_mm, use = "pairwise")
## [1] 0.6561813
Color points by sex of the penguins, each of the clusters seem to be split about 50/50.
penguins %>% ggplot(aes(x = bill_length_mm, y = flipper_length_mm, colour = sex)) + geom_point()
Clusters are related to different species
penguins %>% ggplot(aes(x = bill_length_mm, y = flipper_length_mm, colour = species)) + geom_point()
Artwork by @allison_horst
penguins %>% ggplot(aes(x = bill_length_mm, y = flipper_length_mm, colour = species)) + geom_point() + scale_color_manual(values = c("darkorange","purple","cyan4"))
Lets look distribution of body mass with a histogram
penguins %>% ggplot(aes(x = body_mass_g)) + geom_histogram(binwidth = 100)
Lets look distribution of body mass with a histogram colored by sex
penguins %>% ggplot(aes(x = body_mass_g, fill= sex)) + geom_histogram(binwidth = 100)
We can use facetting by species
penguins %>% ggplot(aes(x = body_mass_g, fill= sex)) + geom_histogram(binwidth = 100) + facet_grid(.~species)
We can use facetting by species
penguins %>% filter(!is.na(sex)) %>% ggplot(aes(x = body_mass_g, fill= sex)) + geom_density(alpha = 0.5) + facet_grid(.~species)
Look at the average body mass for Gentoo penguins separately:
gentoo <- filter(penguins, species=="Gentoo") mean(gentoo$body_mass_g[gentoo$sex == "male"], na.rm=TRUE)
## [1] 5484.836
mean(gentoo$body_mass_g[gentoo$sex == "female"], na.rm=TRUE)
## [1] 4679.741
There is a difference but is it statistically significant?
t.test(body_mass_g ~ sex, data = gentoo)
## ## Welch Two Sample t-test ## ## data: body_mass_g by sex ## t = -14.761, df = 116.64, p-value < 2.2e-16 ## alternative hypothesis: true difference in means between group female and group male is not equal to 0 ## 95 percent confidence interval: ## -913.1130 -697.0763 ## sample estimates: ## mean in group female mean in group male ## 4679.741 5484.836
Try playing with chunks of code from this session to further investigate the penguins data:
median
) bill depths for Adelie and Gentoo penguins? (make sure to use na.rm = TRUE
to exclude missing values)