2022-05-16

R is an Overgrown Calculator

  • Follow along (copy & paste the code into the console):
curl::curl_download(
  "https://raw.githubusercontent.com/heike/summerschool-2022/master/01-Introduction-to-R/code/2-basics.R",
  "2-basics.R"
)
file.edit("2-basics.R")

R is an Overgrown Calculator

# Addition and Subtraction
2 + 5 - 1
## [1] 6
# Multiplication
109*23452
## [1] 2556268
# Division
3/7
## [1] 0.4285714

More Calculator Operations

# Integer division
7 %/% 2
## [1] 3
# Modulo operator (Remainder)
7 %% 2
## [1] 1
# Powers
1.5^3
## [1] 3.375

Math Functions

  • Exponentiation
    • exp(x)
  • Logarithms
    • log(x)
    • log(x, base = 10)
  • Trigonometric functions
    • sin(x)
    • asin(x)
    • cos(x)
    • tan(x)

Creating Objects

We create an object using the assignment operator <-

x <- 5
y <- 21

We then perform any operations on these objects:

log(x)
## [1] 1.609438
y^2
## [1] 441

Rules for Variable Creation

  • Variable names can’t start with a number
  • Variables in R are case-sensitive
  • Some common letters are used internally by R and should be avoided as variable names (c, q, t, C, D, F, T, I)
  • There are reserved words that R won’t let you use for variable names. (for, in, while, if, else, repeat, break, next)
  • R will let you use the name of a predefined function without any warning.




Pro-tip: before introducing a new object, type it in the console to check that it is not yet taken

Vectors

A variable usually consists of more than a single value. We create a vector using the c (combine) function:

y <- c(1, 5, 3, 2)

Operations will then be done element-wise:

y / 2
## [1] 0.5 2.5 1.5 1.0

Getting Help

We will talk MUCH more about vectors in a bit, but for now, let’s talk about a couple ways to get help. The primary function to use is the help function. Just pass in the name of the function you need help with:

help(head)

The ? function also works:

?head

Googling for help is a bit hard. Searches of the form R + CRAN + usually get good results

R Reference Card

Your Turn

Using the R Reference Card at https://cran.r-project.org/doc/contrib/Short-refcard.pdf (and the Help pages, if needed), do the following:

  1. Find out how many rows and columns the `iris’ data set has. Figure out at least 2 ways to do this.

Hint: “Variable Information” section on the first page of the reference card!

  1. Use the rep function to construct the following vector: 1 1 2 2 3 3 4 4 5 5

Hint: “Data Creation” section of the reference card

Give this vector the name x

  1. Square each element in the vector x, then calculate the average value.

Data Frames Introduction

  • penguins is a data frame.
  • Data frames hold data sets
  • Columns can have different types - like an Excel spreadsheet
  • Each column in a data frame is a vector - so each column needs to have values that are all the same type.
  • We can access different columns using the $ operator.
penguins <- read.csv("https://raw.githubusercontent.com/heike/summerschool-2022/main/01-Introduction-to-R/data/penguins.csv", stringsAsFactors = TRUE)

species <- penguins$species
bill_length <- penguins$bill_length_mm

More about Vectors

A vector is a list of values that are all the same type. We have seen that we can create them using the c or the rep function. We can also use the : operator if we wish to create consecutive values:

a <- 10:15
a
## [1] 10 11 12 13 14 15

We can extract the different elements of the vector like so (note, unlike python indexing starts with 1):

bill_length[3]
## [1] 40.3

Indexing Vectors

We have seen that we can access individual elements of the vector. But indexing is a lot more powerful than that:

head(bill_length)
## [1] 39.1 39.5 40.3   NA 36.7 39.3
bill_length[c(1, 3, 5)]
## [1] 39.1 40.3 36.7
bill_length[1:5]
## [1] 39.1 39.5 40.3   NA 36.7

Logical Values

  • R has built in support for logical values
  • TRUE and FALSE are built in. T (for TRUE) and F (for FALSE) are supported but can be modified
  • Logicals can result from a comparison using
    • \(<\)
    • \(>\)
    • \(<=\)
    • \(>=\)
    • \(==\)
    • \(!=\)

Indexing with Logicals

We can index vectors using logical values as well:

x <- c(2, 3, 5, 7)
x[c(TRUE, FALSE, FALSE, TRUE)]
## [1] 2 7
x > 3.5
## [1] FALSE FALSE  TRUE  TRUE
x[x > 3.5]
## [1] 5 7

Logical Examples

bill_length <- penguins$bill_length
head(bill_length)
## [1] 39.1 39.5 40.3   NA 36.7 39.3
short_bills <- bill_length < 35
species[short_bills]
##  [1] <NA>   Adelie Adelie Adelie Adelie Adelie Adelie Adelie Adelie Adelie
## [11] <NA>  
## Levels: Adelie Chinstrap Gentoo

Your Turn

Read up on the tips data.

tips <- read.csv("https://raw.githubusercontent.com/heike/summerschool-2022/master/01-Introduction-to-R/data/tips.csv")

The tips data set consists of 244 parties being served at a restaurant.

  1. Calculate the rate that each party tipped (in percent), i.e. fill the blanks in the statement: tips$tip_pct <- ___ / ___ * 100

  2. Find out how many people tipped over 20%.

Hint: if you use the sum function on a logical vector, it’ll return how many TRUEs are in the vector:

  1. More Challenging: Calculate the sum of the total bills of anyone who tipped over 20%

Data Types in R

  • Use mode or class to find out information about variables
  • str is useful to find information about the structure of your data
  • Many data types: numeric, integer, character, Date, and factor most common
str(penguins)
## 'data.frame':    344 obs. of  8 variables:
##  $ species          : Factor w/ 3 levels "Adelie","Chinstrap",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ island           : Factor w/ 3 levels "Biscoe","Dream",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ bill_length_mm   : num  39.1 39.5 40.3 NA 36.7 39.3 38.9 39.2 34.1 42 ...
##  $ bill_depth_mm    : num  18.7 17.4 18 NA 19.3 20.6 17.8 19.6 18.1 20.2 ...
##  $ flipper_length_mm: int  181 186 195 NA 193 190 181 195 193 190 ...
##  $ body_mass_g      : int  3750 3800 3250 NA 3450 3650 3625 4675 3475 4250 ...
##  $ sex              : Factor w/ 2 levels "female","male": 2 1 1 NA 1 2 1 2 NA NA ...
##  $ year             : int  2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 ...

Some useful functions

There are a whole variety of useful functions to operate on vectors. A couple of the more common ones are length, which returns the length (number of elements) of a vector, and sum, which adds up all the elements of a vector.

x <- bill_length[1:5]
length(x)
## [1] 5
sum(x)
## [1] NA

Statistical Functions

Using the basic functions we’ve learned it wouldn’t be hard to compute some basic statistics.

(n <- length(bill_length))
## [1] 344
(meanlength <- sum(bill_length) / n)
## [1] NA
(standdev <- sqrt(sum((bill_length - meanlength)^2) / (n - 1)))
## [1] NA

But we don’t have to.

Built-in Statistical Functions

mean(bill_length)
## [1] NA
sd(bill_length)
## [1] NA
summary(bill_length)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   32.10   39.23   44.45   43.92   48.50   59.60       2
quantile(bill_length, c(.025, .975), na.rm = TRUE)
##   2.5%  97.5% 
## 34.810 53.085

Element-wise Logical Operators

  • & (elementwise AND)
  • | (elementwise OR)
c(T, T, F, F) & c(T, F, T, F)
## [1]  TRUE FALSE FALSE FALSE
c(T, T, F, F) | c(T, F, T, F)
## [1]  TRUE  TRUE  TRUE FALSE

# How many of the short billed penguins are male?
id <- (bill_length <35  & penguins$sex == "male")
penguins[id,]
##      species    island bill_length_mm bill_depth_mm flipper_length_mm
## NA      <NA>      <NA>             NA            NA                NA
## NA.1    <NA>      <NA>             NA            NA                NA
## 15    Adelie Torgersen           34.6          21.1               198
## NA.2    <NA>      <NA>             NA            NA                NA
##      body_mass_g  sex year
## NA            NA <NA>   NA
## NA.1          NA <NA>   NA
## 15          4400 male 2007
## NA.2          NA <NA>   NA

Your Turn

  1. Read up on the dataset diamonds (?diamonds) in the ggplot2 package
  2. Plot price by carat (use qplot - go back to the motivating example for help with the syntax)
  3. Create a variable ppc for price per carat. Store this variable as a column in the diamonds data
  4. Make a histogram of all ppc values that exceed $10000 per carat.
  5. Explore any other interesting relationships you find