2022-05-16

Motivating Example

  • Kick things off by looking at a real data set in R
  • Goal: get the flavor of using R for data management and exploration
  • Don’t worry about understanding all the coding right away
  • We will come back to see how everything works in detail

Palmer Penguins

  • Allison Horst: data on 344 penguins from three different species and three islands over three years

  • https://allisonhorst.github.io/palmerpenguins/

  • data consists of observations on nesting penguins

  • several variables recorded:

    • body measures of penguins: bill length, bill depth, flipper length, and body mass
    • year
    • species
    • location
    • penguin’s sex (determined by blood test)
  • Primary Question: Can sex be determined without blood test?

Follow along

Copy & paste the code into the console:

curl::curl_download(
  "https://raw.githubusercontent.com/heike/summerschool-2022/master/01-Introduction-to-R/code/1-example.R",
  "1-example.R"
)
file.edit("1-example.R")

First look at data in R

Lets use R to look at the top few rows of the data set. First, we load the penguins data read.csv:

penguins <- read.csv("https://raw.githubusercontent.com/heike/summerschool-2022/main/01-Introduction-to-R/data/penguins.csv", stringsAsFactors = TRUE)

Now, we use the head function to look at the first 6 rows of the data:

head(penguins)
##   species    island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
## 1  Adelie Torgersen           39.1          18.7               181        3750
## 2  Adelie Torgersen           39.5          17.4               186        3800
## 3  Adelie Torgersen           40.3          18.0               195        3250
## 4  Adelie Torgersen             NA            NA                NA          NA
## 5  Adelie Torgersen           36.7          19.3               193        3450
## 6  Adelie Torgersen           39.3          20.6               190        3650
##      sex year
## 1   male 2007
## 2 female 2007
## 3 female 2007
## 4   <NA> 2007
## 5 female 2007
## 6   male 2007

Penguins Data Attributes

How big is this data set and what types of variables are in each column?

str(penguins)
## 'data.frame':    344 obs. of  8 variables:
##  $ species          : Factor w/ 3 levels "Adelie","Chinstrap",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ island           : Factor w/ 3 levels "Biscoe","Dream",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ bill_length_mm   : num  39.1 39.5 40.3 NA 36.7 39.3 38.9 39.2 34.1 42 ...
##  $ bill_depth_mm    : num  18.7 17.4 18 NA 19.3 20.6 17.8 19.6 18.1 20.2 ...
##  $ flipper_length_mm: int  181 186 195 NA 193 190 181 195 193 190 ...
##  $ body_mass_g      : int  3750 3800 3250 NA 3450 3650 3625 4675 3475 4250 ...
##  $ sex              : Factor w/ 2 levels "female","male": 2 1 1 NA 1 2 1 2 NA NA ...
##  $ year             : int  2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 ...
  • python has pandas … R has data.frames
  • data.frames are lists of variables

Penguins Variables

Let’s get a summary of the values for each variable in penguins:

summary(penguins)
##       species          island    bill_length_mm  bill_depth_mm  
##  Adelie   :152   Biscoe   :168   Min.   :32.10   Min.   :13.10  
##  Chinstrap: 68   Dream    :124   1st Qu.:39.23   1st Qu.:15.60  
##  Gentoo   :124   Torgersen: 52   Median :44.45   Median :17.30  
##                                  Mean   :43.92   Mean   :17.15  
##                                  3rd Qu.:48.50   3rd Qu.:18.70  
##                                  Max.   :59.60   Max.   :21.50  
##                                  NA's   :2       NA's   :2      
##  flipper_length_mm  body_mass_g       sex           year     
##  Min.   :172.0     Min.   :2700   female:165   Min.   :2007  
##  1st Qu.:190.0     1st Qu.:3550   male  :168   1st Qu.:2007  
##  Median :197.0     Median :4050   NA's  : 11   Median :2008  
##  Mean   :200.9     Mean   :4202                Mean   :2008  
##  3rd Qu.:213.0     3rd Qu.:4750                3rd Qu.:2009  
##  Max.   :231.0     Max.   :6300                Max.   :2009  
##  NA's   :2         NA's   :2

How do penguins’ body measurements relate to each other?

Scatterplots: bill length versus flipper length

First, we need to install and load ggplot2:

install.packages("ggplot2")
library(ggplot2)
library(dplyr)
penguins %>% ggplot(aes(x = bill_length_mm, y = flipper_length_mm)) + geom_point()

There seem to be some clusters

Correlation is below 0.70, but there is a positive linear relationship in some clusters

cor(penguins$bill_length_mm, penguins$flipper_length_mm, use = "pairwise")
## [1] 0.6561813

Color points

Color points by sex of the penguins, each of the clusters seem to be split about 50/50.

penguins %>% 
  ggplot(aes(x = bill_length_mm, y = flipper_length_mm, colour = sex)) + 
  geom_point()

Closer look

Clusters are related to different species

penguins %>% 
  ggplot(aes(x = bill_length_mm, y = flipper_length_mm, colour = species)) + 
  geom_point()

Meet the Penguins!

Artwork by @allison_horst

Clusters of penguins

penguins %>% 
  ggplot(aes(x = bill_length_mm, y = flipper_length_mm, colour = species)) + 
  geom_point() +
  scale_color_manual(values = c("darkorange","purple","cyan4")) 

Back to trying to determine gender

Lets look distribution of body mass with a histogram

penguins %>% 
  ggplot(aes(x = body_mass_g)) + geom_histogram(binwidth = 100)

Back to trying to determine sex of penguins

Lets look distribution of body mass with a histogram colored by sex

penguins %>% 
  ggplot(aes(x = body_mass_g, fill= sex)) + geom_histogram(binwidth = 100)

Body mass and sex of penguins

We can use facetting by species

penguins %>% 
  ggplot(aes(x = body_mass_g, fill= sex)) + geom_histogram(binwidth = 100) +
  facet_grid(.~species)

Body mass and sex of penguins

We can use facetting by species

penguins %>% filter(!is.na(sex)) %>%
  ggplot(aes(x = body_mass_g, fill= sex)) + geom_density(alpha = 0.5) +
  facet_grid(.~species)

Look at the average body mass for Gentoo penguins separately:

gentoo <- filter(penguins, species=="Gentoo")

mean(gentoo$body_mass_g[gentoo$sex == "male"], na.rm=TRUE)
## [1] 5484.836
mean(gentoo$body_mass_g[gentoo$sex == "female"], na.rm=TRUE)
## [1] 4679.741

Statistical Significance

There is a difference but is it statistically significant?

t.test(body_mass_g ~ sex, data = gentoo)
## 
##  Welch Two Sample t-test
## 
## data:  body_mass_g by sex
## t = -14.761, df = 116.64, p-value < 2.2e-16
## alternative hypothesis: true difference in means between group female and group male is not equal to 0
## 95 percent confidence interval:
##  -913.1130 -697.0763
## sample estimates:
## mean in group female   mean in group male 
##             4679.741             5484.836

Your Turn

Try playing with chunks of code from this session to further investigate the penguins data:

  1. What is the relationship between penguins bill length and bill depth? What is the correlation? Would you expect the same correlation for all species?
  2. What is the median (median) bill depths for Adelie and Gentoo penguins? (make sure to use na.rm = TRUE to exclude missing values)
  3. Is body mass discriminating between male and female Adelie penguins as well as Gentoo penguins?