- Kick things off by looking at a real data set in R
- Goal: get the flavor of using R for data management and exploration
- Don't worry about understanding all the coding right away
- We will come back to see how everything works in detail
2016-06-21
Primary Question: How do these variable influence the amount being tipped?
Follow along (copy & paste the code into the console):
curl::curl_download( "https://raw.githubusercontent.com/heike/rwrks/gh-pages/summerschool/01-Introduction-to-R/code/1-example.R", "1-example.R" ) file.edit("1-example.R")
Lets use R to look at the top few rows of the tips data set. First, we load the tips data read.csv
:
tips <- read.csv("http://heike.github.io/rwrks/summerschool/data/tips.csv")
Now, we use the head
function to look at the first 6 rows of the data:
head(tips)
## total_bill tip sex smoker day time size ## 1 16.99 1.01 Female No Sun Dinner 2 ## 2 10.34 1.66 Male No Sun Dinner 3 ## 3 21.01 3.50 Male No Sun Dinner 3 ## 4 23.68 3.31 Male No Sun Dinner 2 ## 5 24.59 3.61 Female No Sun Dinner 4 ## 6 25.29 4.71 Male No Sun Dinner 4
How big is this data set and what types of variables are in each column?
str(tips)
## 'data.frame': 244 obs. of 7 variables: ## $ total_bill: num 17 10.3 21 23.7 24.6 ... ## $ tip : num 1.01 1.66 3.5 3.31 3.61 4.71 2 3.12 1.96 3.23 ... ## $ sex : Factor w/ 2 levels "Female","Male": 1 2 2 2 1 2 2 2 2 2 ... ## $ smoker : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ... ## $ day : Factor w/ 4 levels "Fri","Sat","Sun",..: 3 3 3 3 3 3 3 3 3 3 ... ## $ time : Factor w/ 2 levels "Dinner","Lunch": 1 1 1 1 1 1 1 1 1 1 ... ## $ size : int 2 3 3 2 4 4 2 4 2 2 ...
Let's get a summary of the values for each variable in tips:
summary(tips)
## total_bill tip sex smoker day ## Min. : 3.07 Min. : 1.000 Female: 87 No :151 Fri :19 ## 1st Qu.:13.35 1st Qu.: 2.000 Male :157 Yes: 93 Sat :87 ## Median :17.80 Median : 2.900 Sun :76 ## Mean :19.79 Mean : 2.998 Thur:62 ## 3rd Qu.:24.13 3rd Qu.: 3.562 ## Max. :50.81 Max. :10.000 ## time size ## Dinner:176 Min. :1.00 ## Lunch : 68 1st Qu.:2.00 ## Median :2.00 ## Mean :2.57 ## 3rd Qu.:3.00 ## Max. :6.00
First, we need to install and load ggplot2
:
install.packages("ggplot2") library(ggplot2)
qplot(x = total_bill, y = tip, geom = "point", data = tips)
There is a surprising amount of scatter!
Correlation is below 0.70
cor(tips$total_bill, tips$tip)
## [1] 0.6757341
Add linear regression line to the plot
qplot(total_bill, tip, geom = "point", data = tips) + geom_smooth(method = "lm")
What do the horizontal lines mean?
Color the points by lunch and dinner groups
qplot(total_bill, tip, geom = "point", data = tips, colour = time)
Tipping is generally done using a rule of thumb based on a percentage of the total bill. We create a new variable in the data set for the tipping rate = tip / total bill
tips$rate <- tips$tip / tips$total_bill summary(tips$rate)
## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 0.03564 0.12910 0.15480 0.16080 0.19150 0.71030
Lets look distribution of tipping rate values with a histogram
qplot(rate, data = tips, binwidth = .01)
One person tipped over 70%, who are they?
tips[which.max(tips$rate),]
## total_bill tip sex smoker day time size rate ## 173 7.25 5.15 Male Yes Sun Dinner 2 0.7103448
Color the points by gender of the person who pays the bill
qplot(total_bill, tip, geom = "point", data = tips, colour = sex)
Look at the average tipping rate for men and women seperately
mean(tips$rate[tips$sex == "Male"])
## [1] 0.1576505
mean(tips$rate[tips$sex == "Female"])
## [1] 0.1664907
There is a difference but is it statistically significant?
t.test(rate ~ sex, data = tips)
## ## Welch Two Sample t-test ## ## data: rate by sex ## t = 1.1433, df = 206.76, p-value = 0.2542 ## alternative hypothesis: true difference in means is not equal to 0 ## 95 percent confidence interval: ## -0.006404119 0.024084498 ## sample estimates: ## mean in group Female mean in group Male ## 0.1664907 0.1576505
We can use facetting to find whether rates of tipping depend on gender and smoking
qplot(total_bill, tip, geom = "point", data = tips, facets=smoker~sex)
Try playing with chunks of code from this session to further investigate the tips data: