2022-05-16
elements of data management:
lots of examples
tidyverse
tidyverse
is a package bundling several other R packages:
ggplot2
, dplyr
, tidyr
, purrr
, …
share common data representations and API, i.e. work well together
from the tidyverse manifesto:
Reuse existing data structures.
Compose simple functions with the pipe.
Embrace functional programming.
Design for humans.
dplyr
There are a couple of primary dplyr
verbs, representing distinct data analysis tasks:
filter
: Select specified rows of a data frame, produce subsets
mutate
: Add new or change existing columns of the data frame (as functions of existing columns)
arrange
: Reorder the rows of a data frame
select
: Select particular columns of a data frame
summarize
: Create collapsed summaries of a data frame
group_by
: Introduce structure to a data frame
all functions of the tidyverse have data
as their first element
Important: do not use $
notation for variables within these functions, e.g:
ggplot(data = fbi, aes(x = Year, y = Count)) + geom_point()
filter(fbi, Year >= 2017, State == "Iowa")
filter
select a subset of the observations (horizontal selection):
filter (.data, ...)
specify constraints (as logical expression) to data in ...
all constraints are combined by logical and &
Make sure to always call library(dplyr)
before using filter
filter
ExampleFrom the fbi
data, extract all burglaries in 2016:
library(classdata) library(dplyr) fbi %>% filter(Type=="Burglary", Year==2016) %>% head()
## State Abb Year Population Type Count Violent.crime ## 1 Alabama AL 2016 4863300 Burglary 34065 FALSE ## 2 Alaska AK 2016 741894 Burglary 4053 FALSE ## 3 Arizona AZ 2016 6931071 Burglary 37736 FALSE ## 4 Arkansas AR 2016 2988248 Burglary 23771 FALSE ## 5 California CA 2016 39250017 Burglary 188304 FALSE ## 6 Colorado CO 2016 5540545 Burglary 23903 FALSE
mutate
mutate (.data, ...)
Introduce new variables into the data set or transform/update old variables
multiple variables can be changed/introduced
mutate
works sequentially: variables introduced become available in following changes
mutate
ExampleIntroduce a variable Rate
into the fbi
data:
fbi %>% mutate(Rate = Count/Population*70000) %>% head()
## State Abb Year Population Type Count ## 1 Alabama AL 1961 3302000 Murder.and.nonnegligent.Manslaughter 427 ## 2 Alabama AL 1962 3358000 Murder.and.nonnegligent.Manslaughter 316 ## 3 Alabama AL 1963 3347000 Murder.and.nonnegligent.Manslaughter 340 ## 4 Alabama AL 1964 3407000 Murder.and.nonnegligent.Manslaughter 316 ## 5 Alabama AL 1965 3462000 Murder.and.nonnegligent.Manslaughter 395 ## 6 Alabama AL 1966 3517000 Murder.and.nonnegligent.Manslaughter 384 ## Violent.crime Rate ## 1 TRUE 9.052090 ## 2 TRUE 6.587254 ## 3 TRUE 7.110846 ## 4 TRUE 6.492515 ## 5 TRUE 7.986713 ## 6 TRUE 7.642877
arrange
arrange
sorts a data set by the values in one or more variables
Successive variables break ties in previous ones
desc
stands for descending, otherwise rows are sorted from smallest to largest
fbi %>% arrange(desc(Year), Type, desc(Count)) %>% head()
## State Abb Year Population Type Count Violent.crime ## 1 California CA 2018 39557045 Aggravated.assault 105412 TRUE ## 2 Texas TX 2018 28701845 Aggravated.assault 73656 TRUE ## 3 Florida FL 2018 21299325 Aggravated.assault 55551 TRUE ## 4 New York NY 2018 19542209 Aggravated.assault 43171 TRUE ## 5 Tennessee TN 2018 6770010 Aggravated.assault 31717 TRUE ## 6 Michigan MI 2018 9995915 Aggravated.assault 31021 TRUE
select
Select specific variables of a data frame (vertical selection):
select (.data, ...)
specify all variables you want to keep
Variables can be selected by index, e.g. 1:5
, by name (don’t use quotes), or using a selector function, such as starts_with
Negative selection also works, e.g. -1
(not the first variable)
select
ExampleSelect Type, Count, State
, and Year
from the fbi
data:
fbi %>% arrange(desc(Year), Type, desc(Count)) %>% select(Type, Count, State, Year) %>% head()
## Type Count State Year ## 1 Aggravated.assault 105412 California 2018 ## 2 Aggravated.assault 73656 Texas 2018 ## 3 Aggravated.assault 55551 Florida 2018 ## 4 Aggravated.assault 43171 New York 2018 ## 5 Aggravated.assault 31717 Tennessee 2018 ## 6 Aggravated.assault 31021 Michigan 2018
Use the fbiwide
data set from the classdata
package
Write out at least three different ways of selecting all variables describing incidences of different types of crimes
summarize
summarize (.data, ...)
summarize observations into a (set of) one-number statistic(s):
Creates a new dataset with 1 row and one column for each of the summary statistics
summarise
ExampleCalculate the mean and standard deviation of Crime rates in the fbi
data
fbi %>% summarise(mean_rate = mean(Count/Population*70000, na.rm=TRUE), sd_rate = sd(Count/Population*70000, na.rm = TRUE))
## mean_rate sd_rate ## 1 395.5483 609.999
summarize
and group_by
Power combo!
for each combination of group levels, create one row of a (set of) one-number statistic(s)
The new dataset has one column for each of the summary statistics, and one row for each combination of grouping levels (multiplicative)
summarise
and group_by
For each type of crime, calculate average crime rate and standard deviation.
fbi %>% group_by(Type) %>% summarise(mean_rate = mean(Count/Population*70000, na.rm=TRUE), sd_rate = sd(Count/Population*70000, na.rm = TRUE))
## # A tibble: 8 × 3 ## Type mean_rate sd_rate ## <fct> <dbl> <dbl> ## 1 Aggravated.assault 167. 112. ## 2 Burglary 623. 312. ## 3 Larceny.theft 1648. 626. ## 4 Legacy.rape 20.2 10.8 ## 5 Motor.vehicle.theft 243. 157. ## 6 Murder.and.nonnegligent.Manslaughter 4.58 4.22 ## 7 Rape 30.1 12.9 ## 8 Robbery 87.8 102.