2016-06-21

Motivating Example

  • Kick things off by looking at a real data set in R
  • Goal: get the flavor of using R for data management and exploration
  • Don't worry about understanding all the coding right away
  • We will come back to see how everything works in detail

Tips Dataset

  • Tips data set recorded by a server in a restaurant over a span of about 10 weeks
  • Server recorded several variables about groups he served:
    • Amount he was tipped
    • Overall bill
    • Several characteristics about the groups being served
  • Primary Question: How do these variable influence the amount being tipped?

  • Follow along (copy & paste the code into the console):

curl::curl_download(
  "https://raw.githubusercontent.com/heike/rwrks/gh-pages/summerschool/01-Introduction-to-R/code/1-example.R",
  "1-example.R"
)
file.edit("1-example.R")

First look at data in R

Lets use R to look at the top few rows of the tips data set. First, we load the tips data read.csv:

tips <- read.csv("http://heike.github.io/rwrks/summerschool/data/tips.csv")

Now, we use the head function to look at the first 6 rows of the data:

head(tips)
##   total_bill  tip    sex smoker day   time size
## 1      16.99 1.01 Female     No Sun Dinner    2
## 2      10.34 1.66   Male     No Sun Dinner    3
## 3      21.01 3.50   Male     No Sun Dinner    3
## 4      23.68 3.31   Male     No Sun Dinner    2
## 5      24.59 3.61 Female     No Sun Dinner    4
## 6      25.29 4.71   Male     No Sun Dinner    4

Tips Data Attributes

How big is this data set and what types of variables are in each column?

str(tips)
## 'data.frame':    244 obs. of  7 variables:
##  $ total_bill: num  17 10.3 21 23.7 24.6 ...
##  $ tip       : num  1.01 1.66 3.5 3.31 3.61 4.71 2 3.12 1.96 3.23 ...
##  $ sex       : Factor w/ 2 levels "Female","Male": 1 2 2 2 1 2 2 2 2 2 ...
##  $ smoker    : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ day       : Factor w/ 4 levels "Fri","Sat","Sun",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ time      : Factor w/ 2 levels "Dinner","Lunch": 1 1 1 1 1 1 1 1 1 1 ...
##  $ size      : int  2 3 3 2 4 4 2 4 2 2 ...

Tips Variables

Let's get a summary of the values for each variable in tips:

summary(tips)
##    total_bill         tip             sex      smoker      day    
##  Min.   : 3.07   Min.   : 1.000   Female: 87   No :151   Fri :19  
##  1st Qu.:13.35   1st Qu.: 2.000   Male  :157   Yes: 93   Sat :87  
##  Median :17.80   Median : 2.900                          Sun :76  
##  Mean   :19.79   Mean   : 2.998                          Thur:62  
##  3rd Qu.:24.13   3rd Qu.: 3.562                                   
##  Max.   :50.81   Max.   :10.000                                   
##      time          size     
##  Dinner:176   Min.   :1.00  
##  Lunch : 68   1st Qu.:2.00  
##               Median :2.00  
##               Mean   :2.57  
##               3rd Qu.:3.00  
##               Max.   :6.00

How does the overall amount affect the tip?

Scatterplots: tip versus total bill

First, we need to install and load ggplot2:

install.packages("ggplot2")
library(ggplot2)
qplot(x = total_bill, y = tip, geom = "point", data = tips)

There is a surprising amount of scatter!

Correlation is below 0.70

cor(tips$total_bill, tips$tip)
## [1] 0.6757341

Points and Line

Add linear regression line to the plot

qplot(total_bill, tip, geom = "point", data = tips) + 
    geom_smooth(method = "lm")

Closer Look

What do the horizontal lines mean?

More Scatterplots

Color the points by lunch and dinner groups

qplot(total_bill, tip, geom = "point", data = tips, colour = time)

Rate of Tipping

Tipping is generally done using a rule of thumb based on a percentage of the total bill. We create a new variable in the data set for the tipping rate = tip / total bill

tips$rate <- tips$tip / tips$total_bill

summary(tips$rate)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.03564 0.12910 0.15480 0.16080 0.19150 0.71030

Tipping Rate Histogram

Lets look distribution of tipping rate values with a histogram

qplot(rate, data = tips, binwidth = .01)

Someone is an AMAZING tipper…

One person tipped over 70%, who are they?

tips[which.max(tips$rate),]
##     total_bill  tip  sex smoker day   time size      rate
## 173       7.25 5.15 Male    Yes Sun Dinner    2 0.7103448

Gender differences?

Color the points by gender of the person who pays the bill

qplot(total_bill, tip, geom = "point", data = tips, colour = sex)

Rates by Gender

Look at the average tipping rate for men and women seperately

mean(tips$rate[tips$sex == "Male"])
## [1] 0.1576505
mean(tips$rate[tips$sex == "Female"])
## [1] 0.1664907

Statistical Significance

There is a difference but is it statistically significant?

t.test(rate ~ sex, data = tips)
## 
##  Welch Two Sample t-test
## 
## data:  rate by sex
## t = 1.1433, df = 206.76, p-value = 0.2542
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.006404119  0.024084498
## sample estimates:
## mean in group Female   mean in group Male 
##            0.1664907            0.1576505

Multiple co-variates

We can use facetting to find whether rates of tipping depend on gender and smoking

qplot(total_bill, tip, geom = "point", data = tips, facets=smoker~sex) 

Your Turn

Try playing with chunks of code from this session to further investigate the tips data:

  1. Get a summary of the total bill values
  2. Find the average tip value for smokers
  3. Get scatterplots of tip versus total bill for different days of the week