do.call - loading in multiple data files

Download http://markwestcott34.github.io/economics/teaching/ss2015/R/datasets/multiple.zip and move all the .DTA files into a new empty folder somewhere

setwd("/Users/mark/tresor/Economics/Teaching/R/materials/session3")
#setwd("C:\\Users\\Mark.Westcott\\Documents\\My Tresors\\Main\\Economics\\Teaching\\R\\materials\\session3")
library(haven)

filenames <- list.files(pattern = "*.DTA")
data_list <- lapply(filenames, read_stata)

We want to do something like this, all the way up to data_list[[length(data_list)]]

rbind(data_list[[1]], data_list[[2]], data_list[[3]])

We can do exactly that using do.call - it passes each element of a list as an argument to a function, e.g.

merged <- do.call(rbind, data_list)

Great, but we’ve lost the information about which file each observation came from *apply functions to the rescue again

#Remember rep? 
rep(c(1,2,3),times=c(3,2,1))
## [1] 1 1 1 2 2 3
sapply(data_list, nrow) 
##  [1]  57  28  75 100   3  30  14   6  67  93  72  99   2  16  83  19  31
## [18]  18   8  46  88  95  56  44  42  90  15
merged$filename <- rep(filenames, times = sapply(data_list, nrow) )

Regressions

Recommended resource: Using R for Introductory Econometrics by Florian Heiss, online at http://www.urfie.net/

OLS

Run OLS regressions in R using the lm() command from the lmtest package. We’ll run the following regressions on a dataset of college admission results:

\[Admission_i = GRE_i + GPA_i + u_i\]

Begin by loading packages and the data

#install.packages("lmtest")
library(lmtest)

#Load in the admissions data
admissions <- read.csv("http://www.ats.ucla.edu/stat/data/binary.csv")

And run the linear regression:

lm(admissions$admit ~ admissions$gre + admissions$gpa)
## 
## Call:
## lm(formula = admissions$admit ~ admissions$gre + admissions$gpa)
## 
## Coefficients:
##    (Intercept)  admissions$gre  admissions$gpa  
##     -0.5279342       0.0005489       0.1542363
#Instead of typing admissions$ every time, we can use the data=argument
lm(admit ~ gre + gpa , data=admissions)
## 
## Call:
## lm(formula = admit ~ gre + gpa, data = admissions)
## 
## Coefficients:
## (Intercept)          gre          gpa  
##  -0.5279342    0.0005489    0.1542363
#Store the output. This is a list, use str() to examine it, or the summary() function to get nice output
admissions_r <- lm(admit ~ gre + gpa , data=admissions)

Get a dataframe of t-tests by using the coeftest() function..

coeftest(admissions_r)
## 
## t test of coefficients:
## 
##                Estimate  Std. Error t value Pr(>|t|)  
## (Intercept) -0.52793418  0.20872769 -2.5293  0.01182 *
## gre          0.00054890  0.00021407  2.5642  0.01071 *
## gpa          0.15423629  0.06497690  2.3737  0.01809 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
coeftest(admissions_r)["gre","t value"]
## [1] 2.564175

..and confidence intervals using the confint() function..

confint(admissions_r)
##                     2.5 %        97.5 %
## (Intercept) -0.9382839225 -0.1175844277
## gre          0.0001280574  0.0009697428
## gpa          0.0264944726  0.2819781075

..and of Wald tests using the waldtest() function.

waldtest(admissions_r)
## Wald test
## 
## Model 1: admit ~ gre + gpa
## Model 2: admit ~ 1
##   Res.Df Df      F    Pr(>F)    
## 1    397                        
## 2    399 -2 9.9064 6.333e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

You can access fitted values and residuals:

admissions$hat <- fitted(admissions_r)
admissions$resid <- residuals(admissions_r)
plot(admissions$admit, admissions$resid)

Run regressions on a subset by passing the critera as the subset argument:

admissions_r <- lm(admit ~ gre + gpa, subset = rank == 1, data = admissions)
#is equivalent to 
admissions_r <- lm(admit ~ gre + gpa , data=admissions[admissions$rank == 1,])

Use I() to ‘do math’ inside a function

admissions_r <- lm(admit ~ gre + gpa + I(gpa^2) + I(gpa^3), data=admissions)
admissions_r <- lm(admit ~ gre + gpa + I(rank == 1) , data=admissions)

Turn anything into a factor variable (equivalent of i. in stata)

admissions_r <- lm(admit ~ gre + gpa + factor(rank) , data=admissions)

Use “*" inside the lm() to get interactions plus component parts (## in Stata), or “:” just to get interaction terms (# in Stata), i.e. these are equivalent:

#Load the d dataframe
load(url("http://markwestcott34.github.io/economics/teaching/ss2015/R/datasets/interactions.Rda"))

lm(y ~ x + z + x:z, data=d)
lm(y ~ x*z, data=d)

Missing values get dealt with sensibly by default.

Exercise: Interaction terms

Download the file at http://markwestcott34.github.io/economics/teaching/ss2015/R/datasets/mroz.dta and store it in a dataframe called mroz using the read_stata() function from the ‘haven’ package. The file contains income and demographic data for families.

  1. Use the cut function to group the age column by categories: 30-39, 40-49, 50-59 (hint: figure out how cut(1:20, breaks = c(0,5,10,20)) works). Add the age category to the mroz data frame.

  2. Run the following regression:

\[log(faminc_i) = agecat_i + city_i + city_i * agecat_i + u_i\]

where \(city_i * agecat_i\) is the interaction between those two terms.

Hypothesis testing

Use the linearHypothesis function from the car package to test linear hypothesis or run joint F-tests.

#install.packages("car")
library(car) #to get the 'Duncan' dataset and the linearHypothesis function
Duncan_reg <- lm(prestige ~ income + education, data=Duncan)

linearHypothesis(Duncan_reg, "income = education")
## Linear hypothesis test
## 
## Hypothesis:
## income - education = 0
## 
## Model 1: restricted model
## Model 2: prestige ~ income + education
## 
##   Res.Df    RSS Df Sum of Sq      F Pr(>F)
## 1     43 7518.9                           
## 2     42 7506.7  1    12.195 0.0682 0.7952
lh <- linearHypothesis(Duncan_reg, "income = education")
lh[2,"F"]
## [1] 0.06823286
linearHypothesis(Duncan_reg, "2 * income = education + income")
## Linear hypothesis test
## 
## Hypothesis:
## income - education = 0
## 
## Model 1: restricted model
## Model 2: prestige ~ income + education
## 
##   Res.Df    RSS Df Sum of Sq      F Pr(>F)
## 1     43 7518.9                           
## 2     42 7506.7  1    12.195 0.0682 0.7952
linearHypothesis(Duncan_reg  , c("income","education"))
## Linear hypothesis test
## 
## Hypothesis:
## income = 0
## education = 0
## 
## Model 1: restricted model
## Model 2: prestige ~ income + education
## 
##   Res.Df   RSS Df Sum of Sq      F    Pr(>F)    
## 1     44 43688                                  
## 2     42  7507  2     36181 101.22 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

You can use matchCoefs to find all coefficients matching a certain string

Duncan_reg <- lm(prestige ~ type*(income + education), data=Duncan)
linearHypothesis(Duncan_reg  , matchCoefs(Duncan_reg, ":"))
## Linear hypothesis test
## 
## Hypothesis:
## typeprof:income = 0
## typewc:income = 0
## typeprof:education = 0
## typewc:education = 0
## 
## Model 1: restricted model
## Model 2: prestige ~ type * (income + education)
## 
##   Res.Df    RSS Df Sum of Sq      F Pr(>F)
## 1     40 3798.0                           
## 2     36 3350.6  4    447.31 1.2015  0.327

Heteroskedasticity robust SEs

Not quite as easy as in Stata. You need to get a variance/co-variance matrix and pass that to coeftest() or linearHypothesis().

library("AER")
## Loading required package: sandwich
## Loading required package: survival
data(PublicSchools)
public_schools_r <- lm(Expenditure ~ Income + I(Income^2), data=PublicSchools)
summary(public_schools_r)
## 
## Call:
## lm(formula = Expenditure ~ Income + I(Income^2), data = PublicSchools)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -160.709  -36.896   -4.551   37.290  109.729 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)   
## (Intercept)  8.329e+02  3.273e+02   2.545  0.01428 * 
## Income      -1.834e-01  8.290e-02  -2.213  0.03182 * 
## I(Income^2)  1.587e-05  5.191e-06   3.057  0.00368 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 56.68 on 47 degrees of freedom
##   (1 observation deleted due to missingness)
## Multiple R-squared:  0.6553, Adjusted R-squared:  0.6407 
## F-statistic: 44.68 on 2 and 47 DF,  p-value: 1.345e-11
vc_homo <- vcov(public_schools_r)  #homo
vc_hc3 <- vcovHC(public_schools_r) #robust, default is HC3
vc_hc1 <- vcovHC(public_schools_r, type = "HC1") #robust, Stata

coeftest(public_schools_r, vc_homo) #same thing as coeftest(public_schools_r)
## 
## t test of coefficients:
## 
##                Estimate  Std. Error t value Pr(>|t|)   
## (Intercept)  8.3291e+02  3.2729e+02  2.5449 0.014275 * 
## Income      -1.8342e-01  8.2899e-02 -2.2126 0.031820 * 
## I(Income^2)  1.5870e-05  5.1908e-06  3.0574 0.003677 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
coeftest(public_schools_r, vc_hc3)
## 
## t test of coefficients:
## 
##                Estimate  Std. Error t value Pr(>|t|)
## (Intercept)  8.3291e+02  1.0950e+03  0.7607   0.4507
## Income      -1.8342e-01  2.9754e-01 -0.6165   0.5406
## I(Income^2)  1.5870e-05  1.9952e-05  0.7954   0.4304
coeftest(public_schools_r, vc_hc1 )
## 
## t test of coefficients:
## 
##                Estimate  Std. Error t value Pr(>|t|)  
## (Intercept)  8.3291e+02  4.7537e+02  1.7521  0.08627 .
## Income      -1.8342e-01  1.2821e-01 -1.4306  0.15915  
## I(Income^2)  1.5870e-05  8.5607e-06  1.8539  0.07004 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
linearHypothesis(public_schools_r, "Income = -2")
## Linear hypothesis test
## 
## Hypothesis:
## Income = - 2
## 
## Model 1: restricted model
## Model 2: Expenditure ~ Income + I(Income^2)
## 
##   Res.Df     RSS Df Sum of Sq      F    Pr(>F)    
## 1     48 1693582                                  
## 2     47  150986  1   1542597 480.19 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
linearHypothesis(public_schools_r, "Income = -2",vcov = vc_hc3)
## Linear hypothesis test
## 
## Hypothesis:
## Income = - 2
## 
## Model 1: restricted model
## Model 2: Expenditure ~ Income + I(Income^2)
## 
## Note: Coefficient covariance matrix supplied.
## 
##   Res.Df Df      F    Pr(>F)    
## 1     48                        
## 2     47  1 37.275 1.865e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Mini exercise:

Create a dataframe which shows the results of the public schools regression with two different standard error types.

It should show the results from the HC3 and the HC1 tests on top of each other, without the intercept row, and with a new column which indicates which model the estimates came from.

Clustered standard errors

Use the cluster.vcov() function from the multiwayvcvov package to calculate cluster-robust variance/covariance matrix

#Clustered SEs
library(multiwayvcov)
data(petersen)

m1 <- lm(y ~ x, data=petersen)
vcov_firm <- cluster.vcov(m1, petersen$firmid)
vcov_year <- cluster.vcov(m1, petersen$year)
vcov_twoway <- cluster.vcov(m1, cbind(petersen$firmid, petersen$year))

coeftest(m1, vcov_twoway)
## 
## t test of coefficients:
## 
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 0.029680   0.065066  0.4561   0.6483    
## x           1.034833   0.053561 19.3206   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Cluster bootstrapping is really easy:

boot_firm_vcov <- cluster.boot(m1, petersen$firm)
boot_year_vcov <- cluster.boot(m1, petersen$year, boot_type="wild")

coeftest(m1, boot_year_vcov)
## 
## t test of coefficients:
## 
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 0.029680   0.055822  0.5317    0.595    
## x           1.034833   0.061861 16.7285   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Probit and logit

Probit and logit are likewise easy:

myprobit <- glm(admit ~ gre + gpa + rank, family = binomial("probit"), 
                data = admissions)

mylogit <- glm(admit ~ gre + gpa + rank, family = binomial("logit"), 
                data = admissions)

Instrumental variables

Run instrumental variable regressions using the ‘ivreg’ command from the ‘AER’ package. The syntax is as follows:

y ~ x1 + x2 | z1 + z2 + z3

where x1 and x2 are regressors and z1, z2 and z3 are as instruments. Exogenous regressors have to be included as instruments for themselves. E.g. if there is one exogenous regressor exog and one endogenous regressor endo with instrument instr:

`y ~ exog + endo | excl + instr`.

Alternative syntax which can work better when you have lots of exogenous regressors:

y ~ exog + endo | . - endo + instr

Stata code:

ivreg 2sls rent pcturban (hsngval = faminc reg2-reg4)

R equivalent:

library('haven')
hsng2 <- read_dta("http://www.stata-press.com/data/r11/hsng2.dta")

fiv <- ivreg(rent ~ pcturban + hsngval | pcturban + faminc + reg2 + reg3 + reg4,
             data = hsng2)

vc_hc3 <- vcovHC(fiv) 

summary(fiv, vcov = vc_hc3, diagnostics = T)
## 
## Call:
## ivreg(formula = rent ~ pcturban + hsngval | pcturban + faminc + 
##     reg2 + reg3 + reg4, data = hsng2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -84.1948 -11.6023  -0.5239   8.6583  73.6130 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 1.207e+02  1.811e+01   6.667 2.62e-08 ***
## pcturban    8.152e-02  5.704e-01   0.143   0.8870    
## hsngval     2.240e-03  8.791e-04   2.548   0.0142 *  
## 
## Diagnostic tests:
##                  df1 df2 statistic  p-value    
## Weak instruments   4  44     13.30  3.5e-07 ***
## Wu-Hausman         1  46     15.91 0.000236 ***
## Sargan             3  NA     11.29 0.010268 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 22.86 on 47 degrees of freedom
## Multiple R-Squared: 0.5989,  Adjusted R-squared: 0.5818 
## Wald test: 15.53 on 2 and 47 DF,  p-value: 6.621e-06

dplyr

Dplyr:

Functionality:

Great resource: Data Wrangling Cheat Sheet from https://www.rstudio.com/resources/cheatsheets/

The materials here are heavily based on https://cran.rstudio.com/web/packages/dplyr/vignettes/introduction.html

dplyr: Code

Load dplyr and the ‘hflights’ dataset (flights departing from two Houston airports in 2011). We call tbl_df() on hflights to convert the data to the tbl class: these are exactly the same as data frames, but print out nicely.

# load packages
#install.packages("dplyr")
library(dplyr)
#install.packages("hflights")
library(hflights)

# load and explore data
flights <- tbl_df(hflights)

filter: Keep rows matching criteria

Base R:

flights[flights$Month==1 & flights$DayofMonth==1, ]

Dplyr:

filter(flights, Month==1 & DayofMonth==1)
## Source: local data frame [552 x 21]
## 
##    Year Month DayofMonth DayOfWeek DepTime ArrTime UniqueCarrier FlightNum
## 1  2011     1          1         6    1400    1500            AA       428
## 2  2011     1          1         6     728     840            AA       460
## 3  2011     1          1         6    1631    1736            AA      1121
## 4  2011     1          1         6    1756    2112            AA      1294
## 5  2011     1          1         6    1012    1347            AA      1700
## 6  2011     1          1         6    1211    1325            AA      1820
## 7  2011     1          1         6     557     906            AA      1994
## 8  2011     1          1         6    1824    2106            AS       731
## 9  2011     1          1         6     654    1124            B6       620
## 10 2011     1          1         6    1639    2110            B6       622
## ..  ...   ...        ...       ...     ...     ...           ...       ...
## Variables not shown: TailNum (chr), ActualElapsedTime (int), AirTime
##   (int), ArrDelay (int), DepDelay (int), Origin (chr), Dest (chr),
##   Distance (int), TaxiIn (int), TaxiOut (int), Cancelled (int),
##   CancellationCode (chr), Diverted (int)
# note: you can use comma or ampersand to represent AND condition
#filter(flights, Month==1, DayofMonth==1)

# use pipe for OR condition
filter(flights, UniqueCarrier=="AA" | UniqueCarrier=="UA")
## Source: local data frame [5,316 x 21]
## 
##    Year Month DayofMonth DayOfWeek DepTime ArrTime UniqueCarrier FlightNum
## 1  2011     1          1         6    1400    1500            AA       428
## 2  2011     1          2         7    1401    1501            AA       428
## 3  2011     1          3         1    1352    1502            AA       428
## 4  2011     1          4         2    1403    1513            AA       428
## 5  2011     1          5         3    1405    1507            AA       428
## 6  2011     1          6         4    1359    1503            AA       428
## 7  2011     1          7         5    1359    1509            AA       428
## 8  2011     1          8         6    1355    1454            AA       428
## 9  2011     1          9         7    1443    1554            AA       428
## 10 2011     1         10         1    1443    1553            AA       428
## ..  ...   ...        ...       ...     ...     ...           ...       ...
## Variables not shown: TailNum (chr), ActualElapsedTime (int), AirTime
##   (int), ArrDelay (int), DepDelay (int), Origin (chr), Dest (chr),
##   Distance (int), TaxiIn (int), TaxiOut (int), Cancelled (int),
##   CancellationCode (chr), Diverted (int)
# you can also use %in% operator
filter(flights, UniqueCarrier %in% c("AA", "UA"))
## Source: local data frame [5,316 x 21]
## 
##    Year Month DayofMonth DayOfWeek DepTime ArrTime UniqueCarrier FlightNum
## 1  2011     1          1         6    1400    1500            AA       428
## 2  2011     1          2         7    1401    1501            AA       428
## 3  2011     1          3         1    1352    1502            AA       428
## 4  2011     1          4         2    1403    1513            AA       428
## 5  2011     1          5         3    1405    1507            AA       428
## 6  2011     1          6         4    1359    1503            AA       428
## 7  2011     1          7         5    1359    1509            AA       428
## 8  2011     1          8         6    1355    1454            AA       428
## 9  2011     1          9         7    1443    1554            AA       428
## 10 2011     1         10         1    1443    1553            AA       428
## ..  ...   ...        ...       ...     ...     ...           ...       ...
## Variables not shown: TailNum (chr), ActualElapsedTime (int), AirTime
##   (int), ArrDelay (int), DepDelay (int), Origin (chr), Dest (chr),
##   Distance (int), TaxiIn (int), TaxiOut (int), Cancelled (int),
##   CancellationCode (chr), Diverted (int)

slice: Select rows by position

Base R:

flights[7:20]
## Source: local data frame [227,496 x 14]
## 
##    UniqueCarrier FlightNum TailNum ActualElapsedTime AirTime ArrDelay
## 1             AA       428  N576AA                60      40      -10
## 2             AA       428  N557AA                60      45       -9
## 3             AA       428  N541AA                70      48       -8
## 4             AA       428  N403AA                70      39        3
## 5             AA       428  N492AA                62      44       -3
## 6             AA       428  N262AA                64      45       -7
## 7             AA       428  N493AA                70      43       -1
## 8             AA       428  N477AA                59      40      -16
## 9             AA       428  N476AA                71      41       44
## 10            AA       428  N504AA                70      45       43
## ..           ...       ...     ...               ...     ...      ...
## Variables not shown: DepDelay (int), Origin (chr), Dest (chr), Distance
##   (int), TaxiIn (int), TaxiOut (int), Cancelled (int), CancellationCode
##   (chr)

Dplyr:

slice(flights, 7:20) #or filter(flights, row_number() %in% 7:20) 
## Source: local data frame [14 x 21]
## 
##    Year Month DayofMonth DayOfWeek DepTime ArrTime UniqueCarrier FlightNum
## 1  2011     1          7         5    1359    1509            AA       428
## 2  2011     1          8         6    1355    1454            AA       428
## 3  2011     1          9         7    1443    1554            AA       428
## 4  2011     1         10         1    1443    1553            AA       428
## 5  2011     1         11         2    1429    1539            AA       428
## 6  2011     1         12         3    1419    1515            AA       428
## 7  2011     1         13         4    1358    1501            AA       428
## 8  2011     1         14         5    1357    1504            AA       428
## 9  2011     1         15         6    1359    1459            AA       428
## 10 2011     1         16         7    1359    1509            AA       428
## 11 2011     1         17         1    1530    1634            AA       428
## 12 2011     1         18         2    1408    1508            AA       428
## 13 2011     1         19         3    1356    1503            AA       428
## 14 2011     1         20         4    1507    1622            AA       428
## Variables not shown: TailNum (chr), ActualElapsedTime (int), AirTime
##   (int), ArrDelay (int), DepDelay (int), Origin (chr), Dest (chr),
##   Distance (int), TaxiIn (int), TaxiOut (int), Cancelled (int),
##   CancellationCode (chr), Diverted (int)

select: pick columns by name

Base R:

flights[, c("DepTime", "ArrTime", "FlightNum")]

dplyr:

select(flights, DepTime, ArrTime, FlightNum)
## Source: local data frame [227,496 x 3]
## 
##    DepTime ArrTime FlightNum
## 1     1400    1500       428
## 2     1401    1501       428
## 3     1352    1502       428
## 4     1403    1513       428
## 5     1405    1507       428
## 6     1359    1503       428
## 7     1359    1509       428
## 8     1355    1454       428
## 9     1443    1554       428
## 10    1443    1553       428
## ..     ...     ...       ...

Use the colon operator to select a range of columns:

select(flights, Year:DayOfWeek)
## Source: local data frame [227,496 x 4]
## 
##    Year Month DayofMonth DayOfWeek
## 1  2011     1          1         6
## 2  2011     1          2         7
## 3  2011     1          3         1
## 4  2011     1          4         2
## 5  2011     1          5         3
## 6  2011     1          6         4
## 7  2011     1          7         5
## 8  2011     1          8         6
## 9  2011     1          9         7
## 10 2011     1         10         1
## ..  ...   ...        ...       ...

Various helpers, contains, ends_with, starts_with, one_of, num_range are available too:

select(flights, contains("Taxi"))
## Source: local data frame [227,496 x 2]
## 
##    TaxiIn TaxiOut
## 1       7      13
## 2       6       9
## 3       5      17
## 4       9      22
## 5       9       9
## 6       6      13
## 7      12      15
## 8       7      12
## 9       8      22
## 10      6      19
## ..    ...     ...
select(flights, ends_with("Delay"))
## Source: local data frame [227,496 x 2]
## 
##    ArrDelay DepDelay
## 1       -10        0
## 2        -9        1
## 3        -8       -8
## 4         3        3
## 5        -3        5
## 6        -7       -1
## 7        -1       -1
## 8       -16       -5
## 9        44       43
## 10       43       43
## ..      ...      ...
select(flights, starts_with("Day"))
## Source: local data frame [227,496 x 2]
## 
##    DayofMonth DayOfWeek
## 1           1         6
## 2           2         7
## 3           3         1
## 4           4         2
## 5           5         3
## 6           6         4
## 7           7         5
## 8           8         6
## 9           9         7
## 10         10         1
## ..        ...       ...
#a combination
select(flights, Year:DayOfWeek, TailNum, Year, ends_with("Time"))
## Source: local data frame [227,496 x 9]
## 
##    Year Month DayofMonth DayOfWeek TailNum DepTime ArrTime
## 1  2011     1          1         6  N576AA    1400    1500
## 2  2011     1          2         7  N557AA    1401    1501
## 3  2011     1          3         1  N541AA    1352    1502
## 4  2011     1          4         2  N403AA    1403    1513
## 5  2011     1          5         3  N492AA    1405    1507
## 6  2011     1          6         4  N262AA    1359    1503
## 7  2011     1          7         5  N493AA    1359    1509
## 8  2011     1          8         6  N477AA    1355    1454
## 9  2011     1          9         7  N476AA    1443    1554
## 10 2011     1         10         1  N504AA    1443    1553
## ..  ...   ...        ...       ...     ...     ...     ...
## Variables not shown: ActualElapsedTime (int), AirTime (int)

Use the minus sign to deselect a column (much much easier than in base R)

#Use the minus to deselect a column
select(flights, -Year)
## Source: local data frame [227,496 x 20]
## 
##    Month DayofMonth DayOfWeek DepTime ArrTime UniqueCarrier FlightNum
## 1      1          1         6    1400    1500            AA       428
## 2      1          2         7    1401    1501            AA       428
## 3      1          3         1    1352    1502            AA       428
## 4      1          4         2    1403    1513            AA       428
## 5      1          5         3    1405    1507            AA       428
## 6      1          6         4    1359    1503            AA       428
## 7      1          7         5    1359    1509            AA       428
## 8      1          8         6    1355    1454            AA       428
## 9      1          9         7    1443    1554            AA       428
## 10     1         10         1    1443    1553            AA       428
## ..   ...        ...       ...     ...     ...           ...       ...
## Variables not shown: TailNum (chr), ActualElapsedTime (int), AirTime
##   (int), ArrDelay (int), DepDelay (int), Origin (chr), Dest (chr),
##   Distance (int), TaxiIn (int), TaxiOut (int), Cancelled (int),
##   CancellationCode (chr), Diverted (int)

Get unique values of certain columns (or combination of columns) using select() in combination with distinct()

distinct(select(flights,UniqueCarrier, TailNum))
## Source: local data frame [3,327 x 2]
## 
##    UniqueCarrier TailNum
## 1             AA  N576AA
## 2             AA  N557AA
## 3             AA  N541AA
## 4             AA  N403AA
## 5             AA  N492AA
## 6             AA  N262AA
## 7             AA  N493AA
## 8             AA  N477AA
## 9             AA  N476AA
## 10            AA  N504AA
## ..           ...     ...

“Chaining”

This code selects UniqueCarrier and DepDelay columns and filters for delays over 60 minutes

filter(
  select(flights, UniqueCarrier, DepDelay), 
  DepDelay > 60)

But this is quite hard to read, easy to lose track of brackets, etc. Instead you can use the ‘%>%’ operator - read it as “then”

# chaining method
flights %>%
    select(UniqueCarrier, DepDelay) %>%
    filter(DepDelay > 60)
## Source: local data frame [10,242 x 2]
## 
##    UniqueCarrier DepDelay
## 1             AA       90
## 2             AA       67
## 3             AA       74
## 4             AA      125
## 5             AA       82
## 6             AA       99
## 7             AA       70
## 8             AA       61
## 9             AA       74
## 10            AS       73
## ..           ...      ...

arrange: reorder rows

Base R:

flights[order(flights$DepDelay, flights$ArrTime), 
        c("DepDelay", "ArrTime","TailNum")]

Dplyr:

flights %>%
    arrange(DepDelay, ArrTime) %>%
    select(DepDelay, ArrTime, TailNum)
## Source: local data frame [227,496 x 3]
## 
##    DepDelay ArrTime TailNum
## 1       -33    1314  N728SK
## 2       -23    2027  N648MQ
## 3       -19    1810  N11107
## 4       -19    2206  N13908
## 5       -18    1813  N134EV
## 6       -18    1936  N27610
## 7       -17     834  N368NB
## 8       -17    1006  N14943
## 9       -17    1009  N15926
## 10      -17    1156  N14604
## ..      ...     ...     ...
# use `desc` for descending
flights %>%
    arrange(desc(DepDelay), ArrTime, TailNum) %>%
    select(DepDelay, ArrTime, TailNum)
## Source: local data frame [227,496 x 3]
## 
##    DepDelay ArrTime TailNum
## 1       981     452  N69063
## 2       970     808  N473AA
## 3       931     948  N502MQ
## 4       869     124  N670UA
## 5       814    2243  N6EAMQ
## 6       803    1027  N609MQ
## 7       780     807  N74856
## 8       758    1040  N75861
## 9       730     149  N764NC
## 10      691     824  N651MQ
## ..      ...     ...     ...

rename: Rename variables

Base R:

names(flights)[names(flights) == "DepTime"] <- "DepartureTime"
names(flights)[names(flights) == "ArrTime"] <- "ArrivalTime"

Dplyr:

flights %>% 
  rename(DepartureTime = DepTime, ArrivalTime = ArrTime)
## Source: local data frame [227,496 x 21]
## 
##    Year Month DayofMonth DayOfWeek DepartureTime ArrivalTime UniqueCarrier
## 1  2011     1          1         6          1400        1500            AA
## 2  2011     1          2         7          1401        1501            AA
## 3  2011     1          3         1          1352        1502            AA
## 4  2011     1          4         2          1403        1513            AA
## 5  2011     1          5         3          1405        1507            AA
## 6  2011     1          6         4          1359        1503            AA
## 7  2011     1          7         5          1359        1509            AA
## 8  2011     1          8         6          1355        1454            AA
## 9  2011     1          9         7          1443        1554            AA
## 10 2011     1         10         1          1443        1553            AA
## ..  ...   ...        ...       ...           ...         ...           ...
## Variables not shown: FlightNum (int), TailNum (chr), ActualElapsedTime
##   (int), AirTime (int), ArrDelay (int), DepDelay (int), Origin (chr), Dest
##   (chr), Distance (int), TaxiIn (int), TaxiOut (int), Cancelled (int),
##   CancellationCode (chr), Diverted (int)

mutate: Add/replace variables

Base R:

flights$Speed <- flights$Distance / flights$AirTime*60

Dplyr:

# Prints the new variable but does not store it
flights %>%
    mutate(Speed = Distance/AirTime*60) %>%
    select(Year,Month,DayofMonth,FlightNum,Speed)
## Source: local data frame [227,496 x 5]
## 
##    Year Month DayofMonth FlightNum    Speed
## 1  2011     1          1       428 336.0000
## 2  2011     1          2       428 298.6667
## 3  2011     1          3       428 280.0000
## 4  2011     1          4       428 344.6154
## 5  2011     1          5       428 305.4545
## 6  2011     1          6       428 298.6667
## 7  2011     1          7       428 312.5581
## 8  2011     1          8       428 336.0000
## 9  2011     1          9       428 327.8049
## 10 2011     1         10       428 298.6667
## ..  ...   ...        ...       ...      ...
# store the new variable
flights <- 
  flights %>%
    mutate(Speed = Distance/AirTime*60)

#install.packages("magrittr")
library(magrittr)
flights %<>%
  mutate(Speed = Distance/AirTime * 60)

Of course, you can use the ‘ifelse’ function to match a condition:

flights %>% 
  mutate(SpeedCensored = ifelse(Speed > 600, 700, Speed)) %>%
  select(Speed, SpeedCensored) %>%
  arrange(desc(Speed))
## Source: local data frame [227,496 x 2]
## 
##       Speed SpeedCensored
## 1  763.6364           700
## 2  670.0000           700
## 3  644.4706           700
## 4  639.5745           700
## 5  618.2927           700
## 6  608.6957           700
## 7  606.1111           700
## 8  606.1111           700
## 9  604.3165           700
## 10 604.0000           700
## ..      ...           ...

group_by : groups a table

Use the group_by function to break a dataset down into groups of rows. Once a dataset is grouped: – arrange sorts only within groups – slice(x) takes the x’th element within the group – summarize allows you to summarize a variable within the group – summarize_each allows you to summarize multiple variables within the group – n() counts the number of entries in a group

flights %>%
    group_by(Dest)
## Source: local data frame [227,496 x 22]
## Groups: Dest
## 
##    Year Month DayofMonth DayOfWeek DepTime ArrTime UniqueCarrier FlightNum
## 1  2011     1          1         6    1400    1500            AA       428
## 2  2011     1          2         7    1401    1501            AA       428
## 3  2011     1          3         1    1352    1502            AA       428
## 4  2011     1          4         2    1403    1513            AA       428
## 5  2011     1          5         3    1405    1507            AA       428
## 6  2011     1          6         4    1359    1503            AA       428
## 7  2011     1          7         5    1359    1509            AA       428
## 8  2011     1          8         6    1355    1454            AA       428
## 9  2011     1          9         7    1443    1554            AA       428
## 10 2011     1         10         1    1443    1553            AA       428
## ..  ...   ...        ...       ...     ...     ...           ...       ...
## Variables not shown: TailNum (chr), ActualElapsedTime (int), AirTime
##   (int), ArrDelay (int), DepDelay (int), Origin (chr), Dest (chr),
##   Distance (int), TaxiIn (int), TaxiOut (int), Cancelled (int),
##   CancellationCode (chr), Diverted (int), Speed (dbl)
flights %>%
    select(Dest, Month, DayofMonth) %>%
    group_by(Dest) %>%
    arrange(Month, DayofMonth) %>%
    slice(1) 
## Source: local data frame [116 x 3]
## Groups: Dest
## 
##    Dest Month DayofMonth
## 1   ABQ     1          1
## 2   AEX     1          1
## 3   AGS     4          3
## 4   AMA     1          1
## 5   ANC     5         12
## 6   ASE     1          1
## 7   ATL     1          1
## 8   AUS     1          1
## 9   AVL     1          1
## 10  BFL     4          3
## ..  ...   ...        ...
flights %>%
    group_by(Dest) %>%
    summarise(avg_delay = mean(Cancelled))
## Source: local data frame [116 x 2]
## 
##    Dest   avg_delay
## 1   ABQ 0.008890469
## 2   AEX 0.016574586
## 3   AGS 0.000000000
## 4   AMA 0.024672321
## 5   ANC 0.000000000
## 6   ASE 0.040000000
## 7   ATL 0.017879787
## 8   AUS 0.005376344
## 9   AVL 0.008571429
## 10  BFL 0.001984127
## ..  ...         ...
flights %>%
    group_by(Dest) %>%
    summarise(avg_delay = mean(ArrDelay, na.rm=TRUE),
              sd_delay = sd(ArrDelay, na.rm=TRUE)) 
## Source: local data frame [116 x 3]
## 
##    Dest  avg_delay sd_delay
## 1   ABQ   7.226259 26.77047
## 2   AEX   5.839437 27.86831
## 3   AGS   4.000000       NA
## 4   AMA   6.840095 29.79867
## 5   ANC  26.080645 45.79836
## 6   ASE   6.794643 42.72711
## 7   ATL   8.233251 39.82225
## 8   AUS   7.448718 24.06890
## 9   AVL   9.973988 36.35018
## 10  BFL -13.198807 25.96681
## ..  ...        ...      ...
flights %>%
    group_by(UniqueCarrier) %>%
    summarise_each(funs(mean), Cancelled, Diverted)
## Source: local data frame [15 x 3]
## 
##    UniqueCarrier   Cancelled    Diverted
## 1             AA 0.018495684 0.001849568
## 2             AS 0.000000000 0.002739726
## 3             B6 0.025899281 0.005755396
## 4             CO 0.006782614 0.002627370
## 5             DL 0.015903067 0.003029156
## 6             EV 0.034482759 0.003176044
## 7             F9 0.007159905 0.000000000
## 8             FL 0.009817672 0.003272557
## 9             MQ 0.029044750 0.001936317
## 10            OO 0.013946828 0.003486707
## 11            UA 0.016409266 0.002413127
## 12            US 0.011268986 0.001469868
## 13            WN 0.015504047 0.002293629
## 14            XE 0.015495599 0.003449550
## 15            YV 0.012658228 0.000000000
flights %>%
    group_by(UniqueCarrier) %>%
    summarise_each(funs(max(., na.rm=TRUE)), matches("Delay"))
## Source: local data frame [15 x 3]
## 
##    UniqueCarrier ArrDelay DepDelay
## 1             AA      978      970
## 2             AS      183      172
## 3             B6      335      310
## 4             CO      957      981
## 5             DL      701      730
## 6             EV      469      479
## 7             F9      277      275
## 8             FL      500      507
## 9             MQ      918      931
## 10            OO      380      360
## 11            UA      861      869
## 12            US      433      425
## 13            WN      499      548
## 14            XE      634      628
## 15            YV       72       54
meanGreaterThanZero <- function(x) {
  mean(x[x > 0], na.rm=TRUE)
}

flights %>%
    group_by(UniqueCarrier) %>%
    summarise_each(funs(meanGreaterThanZero), matches("Delay"))
## Source: local data frame [15 x 3]
## 
##    UniqueCarrier ArrDelay DepDelay
## 1             AA 28.49740 24.71647
## 2             AS 22.91195 20.77953
## 3             B6 45.47744 43.53937
## 4             CO 22.13374 17.92350
## 5             DL 32.12463 32.44865
## 6             EV 40.24231 49.25943
## 7             F9 18.68683 22.65188
## 8             FL 27.85693 33.37097
## 9             MQ 38.75135 37.87749
## 10            OO 24.14663 24.55133
## 11            UA 32.48067 28.76326
## 12            US 20.70235 26.48014
## 13            WN 25.27750 21.85373
## 14            XE 24.19337 26.90385
## 15            YV 18.67568 24.46667
# for each day of the year, count the total number of flights and sort in descending order
flights %>%
    group_by(Month, DayofMonth) %>%
    summarise(number_of_flights = n()) %>%
    arrange(desc(number_of_flights))
## Source: local data frame [365 x 3]
## Groups: Month
## 
##    Month DayofMonth number_of_flights
## 1      1          3               702
## 2      1          2               678
## 3      1         20               663
## 4      1         27               663
## 5      1         13               662
## 6      1          7               661
## 7      1         14               661
## 8      1         21               661
## 9      1         28               661
## 10     1          6               660
## ..   ...        ...               ...

Grouping can sometimes be useful without summarising:

# for each destination, show the number of cancelled and not cancelled flights
flights %>%
    group_by(Dest) %>%
    select(Cancelled) %>%
    table() %>%
    head(15)
##      Cancelled
## Dest     0   1
##   ABQ 2787  25
##   AEX  712  12
##   AGS    1   0
##   AMA 1265  32
##   ANC  125   0
##   ASE  120   5
##   ATL 7745 141
##   AUS 4995  27
##   AVL  347   3
##   BFL  503   1
##   BHM 2697  39
##   BKG  108   2
##   BNA 3451  30
##   BOS 1724  28
##   BPT    3   0

Window functions take n inputs and returns n values. The window functions include tools for ranking and ordering functions (like min_rank), offset functions (lead and lag), and cumulative aggregates (like cummean).

# for each month, calculate the number of flights and the change from the previous month
flights %>%
    group_by(Month) %>%
    summarise(flight_count = n()) %>%
    mutate(change_to_previous_month = flight_count - lag(flight_count))
## Source: local data frame [12 x 3]
## 
##    Month flight_count change_to_previous_month
## 1      1        18910                       NA
## 2      2        17128                    -1782
## 3      3        19470                     2342
## 4      4        18593                     -877
## 5      5        19172                      579
## 6      6        19600                      428
## 7      7        20548                      948
## 8      8        20176                     -372
## 9      9        18065                    -2111
## 10    10        18696                      631
## 11    11        18021                     -675
## 12    12        19117                     1096

Other useful convenience functions

# randomly sample a fixed number of rows, without replacement
flights %>% sample_n(5)
## Source: local data frame [5 x 22]
## 
##   Year Month DayofMonth DayOfWeek DepTime ArrTime UniqueCarrier FlightNum
## 1 2011     1         20         4     724    1029            XE      3032
## 2 2011     1         18         2    1544    1724            F9       376
## 3 2011     1         15         6    1926    2033            XE      2570
## 4 2011     3         17         4    1731    1830            XE      2675
## 5 2011     6         29         3    1025    1354            AA      1700
## Variables not shown: TailNum (chr), ActualElapsedTime (int), AirTime
##   (int), ArrDelay (int), DepDelay (int), Origin (chr), Dest (chr),
##   Distance (int), TaxiIn (int), TaxiOut (int), Cancelled (int),
##   CancellationCode (chr), Diverted (int), Speed (dbl)
# randomly sample a fraction of rows, with replacement
flights %>% sample_frac(0.25, replace=TRUE)
## Source: local data frame [56,874 x 22]
## 
##    Year Month DayofMonth DayOfWeek DepTime ArrTime UniqueCarrier FlightNum
## 1  2011     6         29         3    1632    1723            WN        42
## 2  2011     3         27         7    1538    1749            MQ      3786
## 3  2011    10         28         5    1320    1413            WN        23
## 4  2011     9         23         5    1508    1846            CO      1133
## 5  2011    10          6         4    1749    1832            XE      4436
## 6  2011     7         25         1    2126    2307            XE      2493
## 7  2011     4         23         6    1033    1405            CO      1690
## 8  2011    11         18         5    1808    1909            CO      1603
## 9  2011     6         16         4     931    1049            XE      2079
## 10 2011     2          4         5      NA      NA            CO       546
## ..  ...   ...        ...       ...     ...     ...           ...       ...
## Variables not shown: TailNum (chr), ActualElapsedTime (int), AirTime
##   (int), ArrDelay (int), DepDelay (int), Origin (chr), Dest (chr),
##   Distance (int), TaxiIn (int), TaxiOut (int), Cancelled (int),
##   CancellationCode (chr), Diverted (int), Speed (dbl)

dplyr exercises

Exercise 1

Create a pretty-printing copy of the ‘ChickWeight’ dataframe using chicks <- tbl_df(ChickWeight). ChickWeight is a panel of weights of chickens in various time periods.

  • Use filter to extract a dataframe of chick weights in time period 21

  • Use rename to rename the ‘weight’ column of chicks to ‘Weight’ to make things consistent

  • Use mutate to create a new ‘Height’ column which is calculated as \(Height_i = Weight_i / 3 + N(2, 0.5)\). Make double sure that you have really sampled from the normal distribution seperately for each row!

  • Use group_by and summarize to calculate the mean weight for each chick. Now use arrange to find the chicks with the highest/lowest mean weights.

  • Create a dataframe called ‘summary’ which summarises the chicks in each time period. It should have the following information: number of observations, mean weight, mean height, standard deviation of weight, standard deviation of height.

  • Notice how the dataset has fewer chicks in the later periods than the earlier – this is because of deaths. Using the ‘lag’ function, create a Deaths column to show how many chicks die in each period.

Exercise 2

At what age do baseball players reach their home run peak? Get a dataframe of statistics for baseball players in various years by doing the following:

#install.packages("Lahman")
library(Lahman)
players <- tbl_df(merge(Batting,Master,by= "playerID"))

Each player is uniquely identified by a playerID – your task is to find out how old each player is in the first year that they score their highest ever number of home runs (stored in the ‘HR’ column).

dplyr joins

Joins (merge in Stata lingo) are useful when data about an observation is spread through multiple tables. There are a few types, which correspond to the different keep options in Stata

We’ll work with this data:

library(readr)
superheroes <-
  read_csv(
    paste(c("name,alignment,gender,publisher",
      "Magneto,bad,male,Marvel",
      "Storm,good,female,Marvel",
      "Mystique,bad,female,Marvel",
      "Batman,good,male,DC",
      "Joker,bad,male,DC",
      "Catwoman,bad,female,DC",
      "Hellboy,good,male,Dark Horse Comics"), collapse="\n")
  )

publishers <-
  read_csv(
    paste(c("publisher,year_founded",
    "DC,1934",
    "Marvel,1939",
    "Image,1992"), collapse= "\n")
  )

inner_join(x, y): return all rows from x where there are matching values in y, and all columns from x and y. If there are multiple matches between x and y, all combination of the matches are returned

  inner_join(superheroes, publishers) #we lose Helloby, "Dark Horse Comics" doesn't exist in publishers
## Joining by: "publisher"
## Source: local data frame [6 x 5]
## 
##       name alignment gender publisher year_founded
## 1  Magneto       bad   male    Marvel         1939
## 2    Storm      good female    Marvel         1939
## 3 Mystique       bad female    Marvel         1939
## 4   Batman      good   male        DC         1934
## 5    Joker       bad   male        DC         1934
## 6 Catwoman       bad female        DC         1934

Note that the join automatically happens on all columns that the tables have in common.

left_join(x, y): return all rows from x, and all columns from x and y. If there are multiple matches between x and y, all combination of the matches are returned

left_join(superheroes, publishers)
## Joining by: "publisher"
## Source: local data frame [7 x 5]
## 
##       name alignment gender         publisher year_founded
## 1  Magneto       bad   male            Marvel         1939
## 2    Storm      good female            Marvel         1939
## 3 Mystique       bad female            Marvel         1939
## 4   Batman      good   male                DC         1934
## 5    Joker       bad   male                DC         1934
## 6 Catwoman       bad female                DC         1934
## 7  Hellboy      good   male Dark Horse Comics           NA
left_join(publishers, superheroes)
## Joining by: "publisher"
## Source: local data frame [7 x 5]
## 
##   publisher year_founded     name alignment gender
## 1        DC         1934   Batman      good   male
## 2        DC         1934    Joker       bad   male
## 3        DC         1934 Catwoman       bad female
## 4    Marvel         1939  Magneto       bad   male
## 5    Marvel         1939    Storm      good female
## 6    Marvel         1939 Mystique       bad female
## 7     Image         1992       NA        NA     NA

full_join(x, y): return all rows from x and y, and all columns from x and y.

full_join(publishers, superheroes)
## Joining by: "publisher"
## Source: local data frame [8 x 5]
## 
##           publisher year_founded     name alignment gender
## 1                DC         1934   Batman      good   male
## 2                DC         1934    Joker       bad   male
## 3                DC         1934 Catwoman       bad female
## 4            Marvel         1939  Magneto       bad   male
## 5            Marvel         1939    Storm      good female
## 6            Marvel         1939 Mystique       bad female
## 7             Image         1992       NA        NA     NA
## 8 Dark Horse Comics           NA  Hellboy      good   male

You can control column matches using the by() argument:

rm(flights)

#install.packages("nycflights13")
library(nycflights13)
nyc_flights <- 
  flights %>% select(year:day, hour, origin, dest, tailnum, carrier)

left_join(nyc_flights, planes)
## Joining by: c("year", "tailnum")
## Source: local data frame [336,776 x 15]
## 
##    year month day hour origin dest tailnum carrier type manufacturer model
## 1  2013     1   1    5    EWR  IAH  N14228      UA   NA           NA    NA
## 2  2013     1   1    5    LGA  IAH  N24211      UA   NA           NA    NA
## 3  2013     1   1    5    JFK  MIA  N619AA      AA   NA           NA    NA
## 4  2013     1   1    5    JFK  BQN  N804JB      B6   NA           NA    NA
## 5  2013     1   1    5    LGA  ATL  N668DN      DL   NA           NA    NA
## 6  2013     1   1    5    EWR  ORD  N39463      UA   NA           NA    NA
## 7  2013     1   1    5    EWR  FLL  N516JB      B6   NA           NA    NA
## 8  2013     1   1    5    LGA  IAD  N829AS      EV   NA           NA    NA
## 9  2013     1   1    5    JFK  MCO  N593JB      B6   NA           NA    NA
## 10 2013     1   1    5    LGA  ORD  N3ALAA      AA   NA           NA    NA
## ..  ...   ... ...  ...    ...  ...     ...     ...  ...          ...   ...
## Variables not shown: engines (int), seats (int), speed (int), engine (chr)
left_join(nyc_flights, planes, by="tailnum")
## Source: local data frame [336,776 x 16]
## 
##    year.x month day hour origin dest tailnum carrier year.y
## 1    2013     1   1    5    EWR  IAH  N14228      UA   1999
## 2    2013     1   1    5    LGA  IAH  N24211      UA   1998
## 3    2013     1   1    5    JFK  MIA  N619AA      AA   1990
## 4    2013     1   1    5    JFK  BQN  N804JB      B6   2012
## 5    2013     1   1    5    LGA  ATL  N668DN      DL   1991
## 6    2013     1   1    5    EWR  ORD  N39463      UA   2012
## 7    2013     1   1    5    EWR  FLL  N516JB      B6   2000
## 8    2013     1   1    5    LGA  IAD  N829AS      EV   1998
## 9    2013     1   1    5    JFK  MCO  N593JB      B6   2004
## 10   2013     1   1    5    LGA  ORD  N3ALAA      AA     NA
## ..    ...   ... ...  ...    ...  ...     ...     ...    ...
## Variables not shown: type (chr), manufacturer (chr), model (chr), engines
##   (int), seats (int), speed (int), engine (chr)
left_join(nyc_flights, airports)
## Error: No common variables. Please specify `by` param.
left_join(nyc_flights, airports, c("dest" = "faa"))
## Source: local data frame [336,776 x 14]
## 
##    year month day hour origin dest tailnum carrier
## 1  2013     1   1    5    EWR  IAH  N14228      UA
## 2  2013     1   1    5    LGA  IAH  N24211      UA
## 3  2013     1   1    5    JFK  MIA  N619AA      AA
## 4  2013     1   1    5    JFK  BQN  N804JB      B6
## 5  2013     1   1    5    LGA  ATL  N668DN      DL
## 6  2013     1   1    5    EWR  ORD  N39463      UA
## 7  2013     1   1    5    EWR  FLL  N516JB      B6
## 8  2013     1   1    5    LGA  IAD  N829AS      EV
## 9  2013     1   1    5    JFK  MCO  N593JB      B6
## 10 2013     1   1    5    LGA  ORD  N3ALAA      AA
## ..  ...   ... ...  ...    ...  ...     ...     ...
## Variables not shown: name (chr), lat (dbl), lon (dbl), alt (int), tz
##   (dbl), dst (chr)
left_join(nyc_flights, airports, c("origin" = "faa"))
## Source: local data frame [336,776 x 14]
## 
##    year month day hour origin dest tailnum carrier                name
## 1  2013     1   1    5    EWR  IAH  N14228      UA Newark Liberty Intl
## 2  2013     1   1    5    LGA  IAH  N24211      UA          La Guardia
## 3  2013     1   1    5    JFK  MIA  N619AA      AA John F Kennedy Intl
## 4  2013     1   1    5    JFK  BQN  N804JB      B6 John F Kennedy Intl
## 5  2013     1   1    5    LGA  ATL  N668DN      DL          La Guardia
## 6  2013     1   1    5    EWR  ORD  N39463      UA Newark Liberty Intl
## 7  2013     1   1    5    EWR  FLL  N516JB      B6 Newark Liberty Intl
## 8  2013     1   1    5    LGA  IAD  N829AS      EV          La Guardia
## 9  2013     1   1    5    JFK  MCO  N593JB      B6 John F Kennedy Intl
## 10 2013     1   1    5    LGA  ORD  N3ALAA      AA          La Guardia
## ..  ...   ... ...  ...    ...  ...     ...     ...                 ...
## Variables not shown: lat (dbl), lon (dbl), alt (int), tz (dbl), dst (chr)

Tidying (reshaping) data with the tidyr package

Tidy data is data where:

Here’s some data on patients heart rates under two different drug regimes. It’s shown three ways, the first two ‘messy’ and the third ‘tidy’.

messy
##      name  a  b
## 1  Wilbur 67 56
## 2 Petunia 80 90
## 3 Gregory 64 50
messy2
##   Drug Wilbur Petunia Gregory
## 1    a     67      80      64
## 2    b     56      90      50
tidy
##      name drug heartrate
## 1  Wilbur    a        67
## 2 Petunia    a        80
## 3 Gregory    a        64
## 4  Wilbur    b        56
## 5 Petunia    b        90
## 6 Gregory    b        50

Three common problems with data:

  1. Column headers are values, not variable names
  2. Multiple variables in one column
  3. Variables in both rows and columns

The ‘tidyr’ provides functions which can be used (sometimes in combination) to solve these problems.

So we can solve the problems as follows:

  1. Column headers are values, not variable names -> gather()
  2. Multiple variables in one column -> seperate()
  3. Variables in both rows and columns -> gather(), seperate() and then spread()

Gather:

library(dplyr)
library(tidyr)

a <- data.frame(
    subjects =  c("John", "Jane", "Jill"), 
    before =  c(3,9,5), 
    after  =  c(4,10,12)
  )

a
##   subjects before after
## 1     John      3     4
## 2     Jane      9    10
## 3     Jill      5    12
gather(a, key = period, value = income, before:after) #use dplyr's column selection syntax
##   subjects period income
## 1     John before      3
## 2     Jane before      9
## 3     Jill before      5
## 4     John  after      4
## 5     Jane  after     10
## 6     Jill  after     12

Another example:

pew <- tbl_df(
  read_csv("http://markwestcott34.github.io/economics/teaching/ss2015/R/datasets/pew.csv")
)

pew
## Source: local data frame [18 x 11]
## 
##                     religion <$10k $10-20k $20-30k $30-40k $40-50k $50-75k
## 1                   Agnostic    27      34      60      81      76     137
## 2                    Atheist    12      27      37      52      35      70
## 3                   Buddhist    27      21      30      34      33      58
## 4                   Catholic   418     617     732     670     638    1116
## 5  Don<U+0092>t know/refused    15      14      15      11      10      35
## 6           Evangelical Prot   575     869    1064     982     881    1486
## 7                      Hindu     1       9       7       9      11      34
## 8    Historically Black Prot   228     244     236     238     197     223
## 9          Jehovah's Witness    20      27      24      24      21      30
## 10                    Jewish    19      19      25      25      30      95
## 11             Mainline Prot   289     495     619     655     651    1107
## 12                    Mormon    29      40      48      51      56     112
## 13                    Muslim     6       7       9      10       9      23
## 14                  Orthodox    13      17      23      32      32      47
## 15           Other Christian     9       7      11      13      13      14
## 16              Other Faiths    20      33      40      46      49      63
## 17     Other World Religions     5       2       3       4       2       7
## 18              Unaffiliated   217     299     374     365     341     528
## Variables not shown: $75-100k (int), $100-150k (int), >150k (int), Don't
##   know/refused (int)
gather(data = pew, key = income_range, value = count, -religion)
## Source: local data frame [180 x 3]
## 
##                     religion income_range count
## 1                   Agnostic        <$10k    27
## 2                    Atheist        <$10k    12
## 3                   Buddhist        <$10k    27
## 4                   Catholic        <$10k   418
## 5  Don<U+0092>t know/refused        <$10k    15
## 6           Evangelical Prot        <$10k   575
## 7                      Hindu        <$10k     1
## 8    Historically Black Prot        <$10k   228
## 9          Jehovah's Witness        <$10k    20
## 10                    Jewish        <$10k    19
## ..                       ...          ...   ...

Seperate:

a <- data.frame(
     SessionRoom = c(paste("Morning", 1:3,sep=":"), paste("Evening", 1:3,sep=":")),
     outcome = sample(6)
)

a
##   SessionRoom outcome
## 1   Morning:1       2
## 2   Morning:2       5
## 3   Morning:3       4
## 4   Evening:1       1
## 5   Evening:2       3
## 6   Evening:3       6
separate(data = a, col = SessionRoom, into = c("Time","Room"), sep=":")
##      Time Room outcome
## 1 Morning    1       2
## 2 Morning    2       5
## 3 Morning    3       4
## 4 Evening    1       1
## 5 Evening    2       3
## 6 Evening    3       6

Spread:

 results <- data.frame(
     Individual = 1:5,
     Treatment = rep(c("A", "B","C","D"), each = 5),
     Payoff = sample(20)
   )

results
##    Individual Treatment Payoff
## 1           1         A     14
## 2           2         A      7
## 3           3         A     19
## 4           4         A     16
## 5           5         A     15
## 6           1         B     18
## 7           2         B     17
## 8           3         B     13
## 9           4         B      4
## 10          5         B     11
## 11          1         C     12
## 12          2         C      6
## 13          3         C      8
## 14          4         C      5
## 15          5         C      9
## 16          1         D      3
## 17          2         D     20
## 18          3         D      1
## 19          4         D      2
## 20          5         D     10
spread(data = results, key = Treatment,value = Payoff)
##   Individual  A  B  C  D
## 1          1 14 18 12  3
## 2          2  7 17  6 20
## 3          3 19 13  8  1
## 4          4 16  4  5  2
## 5          5 15 11  9 10

Tidyr exercises

Exercise 1

Load a dataset of chart songs, the first time they entered the US top 100, and their rank in each week.

library(readr)
billboard <- read_csv("https://raw.githubusercontent.com/hadley/tidyr/master/vignettes/billboard.csv")
billboard
## Source: local data frame [317 x 81]
## 
##    year         artist                   track time date.entered wk1 wk2
## 1  2000          2 Pac Baby Don't Cry (Keep... 4:22   2000-02-26  87  82
## 2  2000        2Ge+her The Hardest Part Of ... 3:15   2000-09-02  91  87
## 3  2000   3 Doors Down              Kryptonite 3:53   2000-04-08  81  70
## 4  2000   3 Doors Down                   Loser 4:24   2000-10-21  76  76
## 5  2000       504 Boyz           Wobble Wobble 3:35   2000-04-15  57  34
## 6  2000           98^0 Give Me Just One Nig... 3:24   2000-08-19  51  39
## 7  2000        A*Teens           Dancing Queen 3:44   2000-07-08  97  97
## 8  2000        Aaliyah           I Don't Wanna 4:15   2000-01-29  84  62
## 9  2000        Aaliyah               Try Again 4:03   2000-03-18  59  53
## 10 2000 Adams, Yolanda           Open My Heart 5:30   2000-08-26  76  76
## ..  ...            ...                     ...  ...          ... ... ...
## Variables not shown: wk3 (int), wk4 (int), wk5 (int), wk6 (int), wk7
##   (int), wk8 (int), wk9 (int), wk10 (int), wk11 (int), wk12 (int), wk13
##   (int), wk14 (int), wk15 (int), wk16 (int), wk17 (int), wk18 (int), wk19
##   (int), wk20 (int), wk21 (int), wk22 (int), wk23 (int), wk24 (int), wk25
##   (int), wk26 (int), wk27 (int), wk28 (int), wk29 (int), wk30 (int), wk31
##   (int), wk32 (int), wk33 (int), wk34 (int), wk35 (int), wk36 (int), wk37
##   (int), wk38 (int), wk39 (int), wk40 (int), wk41 (int), wk42 (int), wk43
##   (int), wk44 (int), wk45 (int), wk46 (int), wk47 (int), wk48 (int), wk49
##   (int), wk50 (int), wk51 (int), wk52 (int), wk53 (int), wk54 (int), wk55
##   (int), wk56 (int), wk57 (int), wk58 (int), wk59 (int), wk60 (int), wk61
##   (int), wk62 (int), wk63 (int), wk64 (int), wk65 (int), wk66 (chr), wk67
##   (chr), wk68 (chr), wk69 (chr), wk70 (chr), wk71 (chr), wk72 (chr), wk73
##   (chr), wk74 (chr), wk75 (chr), wk76 (chr)

Why is the data on chart rankings not tidy? Tidy it! Then use the extract_numeric() function to make the week column numeric. Use dplyr to drop any observations where the song is not in the charts (rank = NA) and to sort by artist, then track, then week.

Exercise 2

How does this code work?

library(dplyr)
#create a nasty dataset
messy <- data.frame(
        cbind(familyid = (1:3), 
              name_father = c("Bill","Art","Paul"), 
              income_father = c(30000,22000,25000),
              name_mother = c("Bess","Amy","Pat"),
              income_mother=c(15000,18000,50000)
              ),
        stringsAsFactors = F
     )

gathered   <- gather(messy, key, value, name_father:income_mother)
seperated  <- separate(gathered, key,c("what","who"),sep="_")
tidied <- spread(seperated,what,value)

broom package: Code

Provides a nice way of getting a regression output into a data frame

#install.packages("broom")
library(broom)

lmfit <- lm(mpg ~ wt, mtcars)
tidy(lmfit) #returns a data frame
##          term  estimate std.error statistic      p.value
## 1 (Intercept) 37.285126  1.877627 19.857575 8.241799e-19
## 2          wt -5.344472  0.559101 -9.559044 1.293959e-10
#could now add a column:
tidy(lmfit) %>%
     mutate(model = "OLS")
##          term  estimate std.error statistic      p.value model
## 1 (Intercept) 37.285126  1.877627 19.857575 8.241799e-19   OLS
## 2          wt -5.344472  0.559101 -9.559044 1.293959e-10   OLS

broom Package: Exercise

What does this code do?

library(dplyr)
library(broom)

doOLS <- function(i) {
                x <- rnorm(50)
                y <- 0.6 + 1*x + rnorm(50)
                tidy(lm(y ~ x)) %>%
                  mutate(i = i)
             }
             
hundred_OLS <- lapply(1:100, doOLS)
results <- do.call(rbind,hundred_OLS)

Exercise: Tying it all together

In this exercise you will use much of what we have covered so far in order to load and merge a messy data structure.

Begin by downloading ‘fuel-economy.zip’ from http://markwestcott34.github.io/economics/teaching/ss2015/R/datasets/fuel-economy.zip and extract it somewhere on your PC.

It contains fuel efficiency data for a number of cars from the USA’s Environmental Protection Agency for the years 1978-183. Data for each year is in a seperate CSV file. Data on vehicle manufacturers is stored in a seperate CSV file, which we want to merge in. There is further details on some vehicles in another set of CSV files, which we’ll also merge in.

Files:

Some information:

Your task is to merge this data into one nice data set. Namely:

  1. Read in all the *CG.csv files and join them together. Add a ‘year’ column which tells you which file each observation came from. (Make sure it’s a numeric!)

  2. Unfortunately, the column names are missing in the all csv files, but you can find them listed in the ‘abbr’ column in the “colnames78-83.csv” file. Add them in

  3. The carline.code column is numeric - so we’ve lost the leading zeros, remember carline codes are always three characters long. Convert the column to strings (as.character()) and use str_pad from the stringr package to add leading zeros.

  4. Extract the manufacture code (first three digits) using the str_sub function from the stringr package.

  5. Now merge in the manufacturer names from mfgcodes.csv. Two problems with that file you’ll want to solve before the merge: there are junk columns “V14” and “V1”, and the manufacture codes have got spaces before/after them. Strip that whitespace using an appropraite function in the string_r package: find its documentation online or with help(package = stringr).

  6. Now we want to merge in the data stored in the *CARLIN.csv files. This contains further information on each carline (for example, how big the interior is for twodoor/fourdoor/hatchback body types). Read all these files in to one data frame. Once again, column names are missing. They should be as follows: “manufacturer”, “class”,“f1”,“f2”,“f3”, “twodoor.p”, “twodoor.l”, “fourdoor.p”, “fourdoor.l”, “hatch.l”, “model”, “codesymm”.

  7. Get rid of the following columns from the carlinedata merged file, they contain misleading information: manufacturer, class, f1, f2, f3, model

  8. We are nearly ready to merge the cars data with all the carline data. But to do so, we need a common ID. To do so:
    1. Take the first 7 digits of the codesymmfrom the carline data
    2. For the car data, combine a two digit year code with the carlinecode. So if the carlinecode was “040857” and the data came from the 1978 file, it would be “78040857”
  9. Do the merge!