2022-05-16

Data management in R: the tidyverse

Outline

  • elements of data management:

    • filtering,
    • sorting, and
    • aggregations
  • lots of examples

tidyverse

tidyverse is a package bundling several other R packages:

  • ggplot2, dplyr, tidyr, purrr, …

  • share common data representations and API, i.e. work well together

  • from the tidyverse manifesto:

  1. Reuse existing data structures.

  2. Compose simple functions with the pipe.

  3. Embrace functional programming.

  4. Design for humans.

dplyr

There are a couple of primary dplyr verbs, representing distinct data analysis tasks:

  • filter: Select specified rows of a data frame, produce subsets

  • mutate: Add new or change existing columns of the data frame (as functions of existing columns)

  • arrange: Reorder the rows of a data frame

  • select: Select particular columns of a data frame

  • summarize: Create collapsed summaries of a data frame

  • group_by: Introduce structure to a data frame

Common structure

all functions of the tidyverse have data as their first element

Important: do not use $ notation for variables within these functions, e.g:

ggplot(data = fbi, aes(x = Year, y = Count)) + 
  geom_point()
filter(fbi, Year >= 2017, State == "Iowa")

filter

select a subset of the observations (horizontal selection):

filter (.data, ...)

specify constraints (as logical expression) to data in ...

all constraints are combined by logical and &

Make sure to always call library(dplyr) before using filter


filter Example

From the fbi data, extract all burglaries in 2016:

library(classdata)
library(dplyr)

fbi %>% filter(Type=="Burglary", Year==2016) %>% head()
##        State Abb Year Population     Type  Count Violent.crime
## 1    Alabama  AL 2016    4863300 Burglary  34065         FALSE
## 2     Alaska  AK 2016     741894 Burglary   4053         FALSE
## 3    Arizona  AZ 2016    6931071 Burglary  37736         FALSE
## 4   Arkansas  AR 2016    2988248 Burglary  23771         FALSE
## 5 California  CA 2016   39250017 Burglary 188304         FALSE
## 6   Colorado  CO 2016    5540545 Burglary  23903         FALSE

mutate

mutate (.data, ...)

Introduce new variables into the data set or transform/update old variables

multiple variables can be changed/introduced

mutate works sequentially: variables introduced become available in following changes


mutate Example

Introduce a variable Rate into the fbi data:

fbi %>% mutate(Rate = Count/Population*70000) %>% head()
##     State Abb Year Population                                 Type Count
## 1 Alabama  AL 1961    3302000 Murder.and.nonnegligent.Manslaughter   427
## 2 Alabama  AL 1962    3358000 Murder.and.nonnegligent.Manslaughter   316
## 3 Alabama  AL 1963    3347000 Murder.and.nonnegligent.Manslaughter   340
## 4 Alabama  AL 1964    3407000 Murder.and.nonnegligent.Manslaughter   316
## 5 Alabama  AL 1965    3462000 Murder.and.nonnegligent.Manslaughter   395
## 6 Alabama  AL 1966    3517000 Murder.and.nonnegligent.Manslaughter   384
##   Violent.crime     Rate
## 1          TRUE 9.052090
## 2          TRUE 6.587254
## 3          TRUE 7.110846
## 4          TRUE 6.492515
## 5          TRUE 7.986713
## 6          TRUE 7.642877

arrange

arrange sorts a data set by the values in one or more variables

Successive variables break ties in previous ones

desc stands for descending, otherwise rows are sorted from smallest to largest

fbi %>% arrange(desc(Year), Type, desc(Count)) %>% head()
##        State Abb Year Population               Type  Count Violent.crime
## 1 California  CA 2018   39557045 Aggravated.assault 105412          TRUE
## 2      Texas  TX 2018   28701845 Aggravated.assault  73656          TRUE
## 3    Florida  FL 2018   21299325 Aggravated.assault  55551          TRUE
## 4   New York  NY 2018   19542209 Aggravated.assault  43171          TRUE
## 5  Tennessee  TN 2018    6770010 Aggravated.assault  31717          TRUE
## 6   Michigan  MI 2018    9995915 Aggravated.assault  31021          TRUE

select

Select specific variables of a data frame (vertical selection):

select (.data, ...)

specify all variables you want to keep

Variables can be selected by index, e.g. 1:5, by name (don’t use quotes), or using a selector function, such as starts_with

Negative selection also works, e.g. -1 (not the first variable)


select Example

Select Type, Count, State, and Year from the fbi data:

fbi %>% arrange(desc(Year), Type, desc(Count)) %>%
  select(Type, Count, State, Year) %>% head()
##                 Type  Count      State Year
## 1 Aggravated.assault 105412 California 2018
## 2 Aggravated.assault  73656      Texas 2018
## 3 Aggravated.assault  55551    Florida 2018
## 4 Aggravated.assault  43171   New York 2018
## 5 Aggravated.assault  31717  Tennessee 2018
## 6 Aggravated.assault  31021   Michigan 2018

Your turn

Use the fbiwide data set from the classdata package

Write out at least three different ways of selecting all variables describing incidences of different types of crimes

summarize

summarize (.data, ...)

summarize observations into a (set of) one-number statistic(s):

Creates a new dataset with 1 row and one column for each of the summary statistics


summarise Example

Calculate the mean and standard deviation of Crime rates in the fbi data

fbi %>% 
    summarise(mean_rate = mean(Count/Population*70000, na.rm=TRUE), 
              sd_rate = sd(Count/Population*70000, na.rm = TRUE))
##   mean_rate sd_rate
## 1  395.5483 609.999

summarize and group_by

Power combo!

for each combination of group levels, create one row of a (set of) one-number statistic(s)

The new dataset has one column for each of the summary statistics, and one row for each combination of grouping levels (multiplicative)


summarise and group_by

For each type of crime, calculate average crime rate and standard deviation.

fbi %>%
    group_by(Type) %>%
    summarise(mean_rate = mean(Count/Population*70000, na.rm=TRUE), 
              sd_rate = sd(Count/Population*70000, na.rm = TRUE))
## # A tibble: 8 × 3
##   Type                                 mean_rate sd_rate
##   <fct>                                    <dbl>   <dbl>
## 1 Aggravated.assault                      167.    112.  
## 2 Burglary                                623.    312.  
## 3 Larceny.theft                          1648.    626.  
## 4 Legacy.rape                              20.2    10.8 
## 5 Motor.vehicle.theft                     243.    157.  
## 6 Murder.and.nonnegligent.Manslaughter      4.58    4.22
## 7 Rape                                     30.1    12.9 
## 8 Robbery                                  87.8   102.

Let’s use these tools