2016-06-21

R is an Overgrown Calculator

  • Follow along (copy & paste the code into the console):
curl::curl_download(
  "https://raw.githubusercontent.com/heike/rwrks/gh-pages/summerschool/01-Introduction-to-R/code/2-basics.R",
  "2-basics.R"
)
file.edit("2-basics.R")

R is an Overgrown Calculator

# Addition and Subtraction
2 + 5 - 1
## [1] 6
# Multiplication
109*23452
## [1] 2556268
# Division
3/7
## [1] 0.4285714

More Calculator Operations

# Integer division
7 %/% 2
## [1] 3
# Modulo operator (Remainder)
7 %% 2
## [1] 1
# Powers
1.5^3
## [1] 3.375

Math Functions

  • Exponentiation
    • exp(x)
  • Logarithms
    • log(x)
    • log(x, base = 10)
  • Trigonometric functions
    • sin(x)
    • asin(x)
    • cos(x)
    • tan(x)

Creating Objects

We can create an object using the assignment operator <-:

x <- 5
todays_date <- 21

We can then perform any of the functions on these objects:

log(x)
## [1] 1.609438
todays_date^2
## [1] 441

Rules for Variable Creation

  • Variable names can't start with a number
  • Variables in R are case-sensitive
  • Some common letters are used internally by R and should be avoided as variable names (c, q, t, C, D, F, T, I)
  • There are reserved words that R won't let you use for variable names. (for, in, while, if, else, repeat, break, next)
  • R will let you use the name of a predefined function without any warning. Try not to overwrite those though!

Vectors

A variable usually consists of more than a single value. We can create a vector using the c (combine) function:

y <- c(1, 5, 3, 2)

Operations will then be done element-wise:

y / 2
## [1] 0.5 2.5 1.5 1.0

Getting Help

We will talk MUCH more about vectors in a bit, but for now, let's talk about a couple ways to get help. The primary function to use is the help function. Just pass in the name of the function you need help with:

help(head)

The ? function also works:

?head

Googling for help is a bit hard. Searches of the form R + CRAN + usually get good results

R Reference Card

Your Turn

Using the R Reference Card (and the Help pages, if needed), do the following:

  1. Find out how many rows and columns the `iris' data set has. Figure out at least 2 ways to do this. Hint: "Variable Information" section on the first page of the reference card!

  2. Use the rep function to construct the following vector: 1 1 2 2 3 3 4 4 5 5 Hint: "Data Creation" section of the reference card

  3. Use rep to construct this vector: 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5

Data Frames Introduction

  • tips is a data frame.
  • Data frames hold data sets
  • Columns can have different types - like an Excel spreadsheet
  • Each column in a data frame is a vector - so each column needs to have values that are all the same type.
  • We can access different columns using the $ operator.
tip <- tips$tip
bill <- tips$total_bill

More about Vectors

A vector is a list of values that are all the same type. We have seen that we can create them using the c or the rep function. We can also use the : operator if we wish to create consecutive values:

a <- 10:15
a
## [1] 10 11 12 13 14 15

We can extract the different elements of the vector like so:

bill[3]
## [1] 21.01

Indexing Vectors

We have seen that we can access individual elements of the vector. But indexing is a lot more powerful than that:

head(tip)
## [1] 1.01 1.66 3.50 3.31 3.61 4.71
tip[c(1, 3, 5)]
## [1] 1.01 3.50 3.61
tip[1:5]
## [1] 1.01 1.66 3.50 3.31 3.61

Logical Values

  • R has built in support for logical values
  • TRUE and FALSE are built in. T (for TRUE) and F (for FALSE) are supported but can be modified
  • Logicals can result from a comparison using
    • \(<\)
    • \(>\)
    • \(<=\)
    • \(>=\)
    • \(==\)
    • \(!=\)

Indexing with Logicals

We can index vectors using logical values as well:

x <- c(2, 3, 5, 7)
x[c(TRUE, FALSE, FALSE, TRUE)]
## [1] 2 7
x > 3.5
## [1] FALSE FALSE  TRUE  TRUE
x[x > 3.5]
## [1] 5 7

Logical Examples

rate <- tips$rate
head(rate)
## [1] 0.05944673 0.16054159 0.16658734 0.13978041 0.14680765 0.18623962
sad_tip <- rate < 0.10
rate[sad_tip]
##  [1] 0.05944673 0.07180385 0.07892660 0.05679667 0.09935739 0.05643341
##  [7] 0.09553024 0.07861635 0.07296137 0.08146640 0.09984301 0.09452888
## [13] 0.07717751 0.07398274 0.06565988 0.09560229 0.09001406 0.07745933
## [19] 0.08364236 0.06653360 0.08527132 0.08329863 0.07936508 0.03563814
## [25] 0.07358352 0.08822232 0.09820426

Your Turn

  1. Find out how many people tipped over 20%.

Hint: if you use the sum function on a logical vector, it'll return how many TRUEs are in the vector:

sum(c(TRUE, TRUE, FALSE, TRUE, FALSE))
## [1] 3
  1. More Challenging: Calculate the sum of the total bills of anyone who tipped over 20%

Modifying Vectors

We can modify vectors using indexing as well:

x <- bill[1:5]
x
## [1] 16.99 10.34 21.01 23.68 24.59
x[1] <- 20
x
## [1] 20.00 10.34 21.01 23.68 24.59

Vector Elements

Elements of a vector must all be the same type:

head(rate)
## [1] 0.05944673 0.16054159 0.16658734 0.13978041 0.14680765 0.18623962
rate[sad_tip] <- ":-("
head(rate)
## [1] ":-("               "0.160541586073501" "0.166587339362208"
## [4] "0.139780405405405" "0.146807645384303" "0.186239620403321"

By changing a value to a string, all the other values got changed as well.

Data Types in R

  • Can use mode or class to find out information about variables
  • str is useful to find information about the structure of your data
  • Many data types: numeric, integer, character, Date, and factor most common
str(tips)
## 'data.frame':    244 obs. of  8 variables:
##  $ total_bill: num  17 10.3 21 23.7 24.6 ...
##  $ tip       : num  1.01 1.66 3.5 3.31 3.61 4.71 2 3.12 1.96 3.23 ...
##  $ sex       : Factor w/ 2 levels "Female","Male": 1 2 2 2 1 2 2 2 2 2 ...
##  $ smoker    : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ day       : Factor w/ 4 levels "Fri","Sat","Sun",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ time      : Factor w/ 2 levels "Dinner","Lunch": 1 1 1 1 1 1 1 1 1 1 ...
##  $ size      : int  2 3 3 2 4 4 2 4 2 2 ...
##  $ rate      : num  0.0594 0.1605 0.1666 0.1398 0.1468 ...

Converting Between Types

We can convert between different types using the as series of functions:

size <- head(tips$size)
size
## [1] 2 3 3 2 4 4
as.character(size)
## [1] "2" "3" "3" "2" "4" "4"
as.numeric("2")
## [1] 2

Some useful functions

There are a whole variety of useful functions to operate on vectors. A couple of the more common ones are length, which returns the length (number of elements) of a vector, and sum, which adds up all the elements of a vector.

x <- tip[1:5]
length(x)
## [1] 5
sum(x)
## [1] 13.09

Statistical Functions

Using the basic functions we've learned it wouldn't be hard to compute some basic statistics.

(n <- length(tip))
## [1] 244
(meantip <- sum(tip) / n)
## [1] 2.998279
(standdev <- sqrt(sum((tip - meantip)^2) / (n - 1)))
## [1] 1.383638

But we don't have to.

Built-in Statistical Functions

mean(tip)
## [1] 2.998279
sd(tip)
## [1] 1.383638
summary(tip)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   2.000   2.900   2.998   3.562  10.000
quantile(tip, c(.025, .975))
##   2.5%  97.5% 
## 1.1760 6.4625

Element-wise Logical Operators

  • & (elementwise AND)
  • | (elementwise OR)
c(T, T, F, F) & c(T, F, T, F)
## [1]  TRUE FALSE FALSE FALSE
c(T, T, F, F) | c(T, F, T, F)
## [1]  TRUE  TRUE  TRUE FALSE
# Which are big bills with a poor tip rate?
id <- (bill > 40 & rate < .10)
tips[id,]
##     total_bill tip    sex smoker day   time size       rate
## 103      44.30 2.5 Female    Yes Sat Dinner    3 0.05643341
## 183      45.35 3.5   Male    Yes Sun Dinner    3 0.07717751
## 185      40.55 3.0   Male    Yes Sun Dinner    2 0.07398274

Your Turn

  1. Read up on the dataset (?diamonds)
  2. Plot price by carat (use qplot - go back to the motivating example for help with the syntax)
  3. Create a variable ppc for price/carat. Store this variable as a column in the diamonds data
  4. Make a histogram of all ppc values that exceed $10000 per carat.
  5. Explore any other interesting relationships you find