Motivating Example

2016-06-21

Motivating Example

Kick things off by looking at a real data set in R
Goal: get the flavor of using R for data management and exploration
Don't worry about understanding all the coding right away
We will come back to see how everything works in detail

Tips Dataset

Tips data set recorded by a server in a restaurant over a span of about 10 weeks
Server recorded several variables about groups he served:
- Amount he was tipped
- Overall bill
- Several characteristics about the groups being served
Primary Question: How do these variable influence the amount being tipped?
Follow along (copy & paste the code into the console):

curl::curl_download(
  "https://raw.githubusercontent.com/heike/rwrks/gh-pages/summerschool/01-Introduction-to-R/code/1-example.R",
  "1-example.R"
)
file.edit("1-example.R")

First look at data in R

Lets use R to look at the top few rows of the tips data set. First, we load the tips data read.csv:

tips <- read.csv("http://heike.github.io/rwrks/summerschool/data/tips.csv")

Now, we use the head function to look at the first 6 rows of the data:

head(tips)

##   total_bill  tip    sex smoker day   time size
## 1      16.99 1.01 Female     No Sun Dinner    2
## 2      10.34 1.66   Male     No Sun Dinner    3
## 3      21.01 3.50   Male     No Sun Dinner    3
## 4      23.68 3.31   Male     No Sun Dinner    2
## 5      24.59 3.61 Female     No Sun Dinner    4
## 6      25.29 4.71   Male     No Sun Dinner    4

Tips Data Attributes

How big is this data set and what types of variables are in each column?

str(tips)

## 'data.frame':    244 obs. of  7 variables:
##  $ total_bill: num  17 10.3 21 23.7 24.6 ...
##  $ tip       : num  1.01 1.66 3.5 3.31 3.61 4.71 2 3.12 1.96 3.23 ...
##  $ sex       : Factor w/ 2 levels "Female","Male": 1 2 2 2 1 2 2 2 2 2 ...
##  $ smoker    : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ day       : Factor w/ 4 levels "Fri","Sat","Sun",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ time      : Factor w/ 2 levels "Dinner","Lunch": 1 1 1 1 1 1 1 1 1 1 ...
##  $ size      : int  2 3 3 2 4 4 2 4 2 2 ...

Tips Variables

Let's get a summary of the values for each variable in tips:

summary(tips)

##    total_bill         tip             sex      smoker      day    
##  Min.   : 3.07   Min.   : 1.000   Female: 87   No :151   Fri :19  
##  1st Qu.:13.35   1st Qu.: 2.000   Male  :157   Yes: 93   Sat :87  
##  Median :17.80   Median : 2.900                          Sun :76  
##  Mean   :19.79   Mean   : 2.998                          Thur:62  
##  3rd Qu.:24.13   3rd Qu.: 3.562                                   
##  Max.   :50.81   Max.   :10.000                                   
##      time          size     
##  Dinner:176   Min.   :1.00  
##  Lunch : 68   1st Qu.:2.00  
##               Median :2.00  
##               Mean   :2.57  
##               3rd Qu.:3.00  
##               Max.   :6.00

How does the overall amount affect the tip?

Scatterplots: tip versus total bill

First, we need to install and load ggplot2:

install.packages("ggplot2")
library(ggplot2)

qplot(x = total_bill, y = tip, geom = "point", data = tips)

There is a surprising amount of scatter!

Correlation is below 0.70

cor(tips$total_bill, tips$tip)

## [1] 0.6757341

Points and Line

Add linear regression line to the plot

qplot(total_bill, tip, geom = "point", data = tips) + 
    geom_smooth(method = "lm")

Closer Look

What do the horizontal lines mean?

More Scatterplots

Color the points by lunch and dinner groups

qplot(total_bill, tip, geom = "point", data = tips, colour = time)

Rate of Tipping

Tipping is generally done using a rule of thumb based on a percentage of the total bill. We create a new variable in the data set for the tipping rate = tip / total bill

tips$rate <- tips$tip / tips$total_bill

summary(tips$rate)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.03564 0.12910 0.15480 0.16080 0.19150 0.71030

Tipping Rate Histogram

Lets look distribution of tipping rate values with a histogram

qplot(rate, data = tips, binwidth = .01)

Someone is an AMAZING tipper…

One person tipped over 70%, who are they?

tips[which.max(tips$rate),]

##     total_bill  tip  sex smoker day   time size      rate
## 173       7.25 5.15 Male    Yes Sun Dinner    2 0.7103448

Gender differences?

Color the points by gender of the person who pays the bill

qplot(total_bill, tip, geom = "point", data = tips, colour = sex)

Rates by Gender

Look at the average tipping rate for men and women seperately

mean(tips$rate[tips$sex == "Male"])

## [1] 0.1576505

mean(tips$rate[tips$sex == "Female"])

## [1] 0.1664907

Statistical Significance

There is a difference but is it statistically significant?

t.test(rate ~ sex, data = tips)

## 
##  Welch Two Sample t-test
## 
## data:  rate by sex
## t = 1.1433, df = 206.76, p-value = 0.2542
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.006404119  0.024084498
## sample estimates:
## mean in group Female   mean in group Male 
##            0.1664907            0.1576505

Multiple co-variates

We can use facetting to find whether rates of tipping depend on gender and smoking

qplot(total_bill, tip, geom = "point", data = tips, facets=smoker~sex)

Your Turn

Try playing with chunks of code from this session to further investigate the tips data:

Get a summary of the total bill values
Find the average tip value for smokers
Get scatterplots of tip versus total bill for different days of the week