Chapter 12 Bottlenecks in Everyday Data Analytics Tasks

This chapter presents three examples of how the lessons from the previous chapters could be applied in everyday data analytics tasks. The first section focuses on the statistics perspective: compute something in a different way (with a different algorithm) but end up with essentially the same result. It is also an illustration of how diverse the already implemented solutions for working with large data in applied econometrics in the R-universe are, and how it makes sense to first look into a more efficient algorithm/statistical procedure than directly use specialized packages such as bigmemory or even scale up in the cloud. The second section is a reminder (in an extremely simple setting) of how we can use R more efficiently when taking basic R characteristics into consideration. It is an example and detailed illustration of how adapting a few simple coding habits with basic R can substantially improve the efficiency of your code for larger workloads. Finally, the third section in this chapter re-visits the topics of scaling up both locally and in the cloud.

12.1 Case study: Efficient fixed effects estimation

In this case study we look into a very common computational problem in applied econometrics: estimation of a fixed effects model with various fixed-effects units (i.e., many intercepts). The aim of this case study is to give an illustration of how a specific statistical procedure can help us reduce the computational burden substantially (here, by reducing the number of columns in the model matrix and therefore the burden of computing the inverse of a huge model matrix). The context of this tutorial builds on a study called “Friends in High Places” by Cohen and Malloy (2014). Cohen and Malloy show that US Senators who are alumni of the same university/college tend to help each other out in votes on industrial policies if the corresponding policy is highly relevant for the state of one senator but not relevant for the state of the other senator. The data is provided along with the published article and can be accessed here: http://doi.org/10.3886/E114873V1. The data (and code) is provided in STATA format. We can import the main dataset with the foreign package (R Core Team 2022). For data handling we load the data.table package and for hypotheses tests we load the lmtest package (Zeileis and Hothorn 2002).

# SET UP ------------------
# load packages
library(foreign)
library(data.table)
library(lmtest)
# fix vars
DATA_PATH <- "data/data_for_tables.dta"

# import data
cm <- as.data.table(read.dta(DATA_PATH))
# keep only clean obs
cm <- cm[!(is.na(yes)
           |is.na(pctsumyessameparty)
           |is.na(pctsumyessameschool)
           |is.na(pctsumyessamestate))] 

As part of this case study, we will replicate parts of Table 3 of the main article (p. 73). Specifically, we will estimate specifications (1) and (2). In both specifications, the dependent variable is an indicator yes that is equal to 1 if the corresponding senator voted Yes on the given bill and 0 otherwise. The main explanatory variables of interest are pctsumyessameschool (the percentage of senators from the same school as the corresponding senator who voted Yes on the given bill), pctsumyessamestate (the percentage of senators from the same state as the corresponding senator who voted Yes on the given bill), and pctsumyessameparty (the percentage of senators from the same party as the corresponding senator who voted Yes on the given bill). Specification 1 accounts for congress (time) fixed effects and senator (individual) fixed effects, and specification 2 accounts for congress-session-vote fixed effects and senator fixed effects.

First, let us look at a very simple example to highlight where the computational burden in the estimation of such specifications is coming from. In terms of the regression model 1, the fixed effect specification means that we introduce an indicator variable (an intercept) for \(N-1\) senators and \(M-1\) congresses. That is, the simple model matrix (\(X\)) without accounting for fixed effects has dimensions \(425653\times4\).

# pooled model (no FE)
model0 <-   yes ~ 
  pctsumyessameschool + 
  pctsumyessamestate + 
  pctsumyessameparty 

dim(model.matrix(model0, data=cm))
## [1] 425653      4

In contrast, the model matrix of specification (1) is of dimensions \(425653\times221\), and the model matrix of specification (2) even of \(425653\times6929\).

model1 <- 
  yes ~ pctsumyessameschool + 
        pctsumyessamestate + 
        pctsumyessameparty + 
        factor(congress) +
        factor(id) -1
mm1 <- model.matrix(model1, data=cm)
dim(mm1)
## [1] 425653    168

Using OLS to estimate such a model thus involves the computation of a very large matrix inversion (because \(\hat{\beta}_{OLS} = (\mathbf{X}^\intercal\mathbf{X})^{-1}\mathbf{X}^{\intercal}\mathbf{y}\)). In addition, the model matrix for specification 2 is about 22GB, which might further slow down the computer due to a lack of physical memory or even crash the R session altogether.

In order to set a point of reference, we first estimate specification (1) with standard OLS.

# fit specification (1)
runtime <- system.time(fit1 <- lm(data = cm, formula = model1))
coeftest(fit1)[2:4,]
##                     Estimate Std. Error t value
## pctsumyessamestate   0.11861   0.001085 109.275
## pctsumyessameparty   0.92640   0.001397 662.910
## factor(congress)101 -0.01458   0.006429  -2.269
##                     Pr(>|t|)
## pctsumyessamestate    0.0000
## pctsumyessameparty    0.0000
## factor(congress)101   0.0233
# median amount of time needed for estimation
runtime[3]
## elapsed 
##   6.486

As expected, this takes quite some time to compute. However, there is an alternative approach to estimating such models that substantially reduces the computational burden by “sweeping out the fixed effects dummies”. In the simple case of only one fixed effect variable (e.g., only individual fixed effects), the trick is called “within transformation” or “demeaning” and is quite simple to implement. For each of the categories in the fixed effect variable, compute the mean of the covariate and subtract the mean from the covariate’s value.

# illustration of within transformation for the senator fixed effects
cm_within <- 
  with(cm, data.table(yes = yes - ave(yes, id),
                      pctsumyessameschool = pctsumyessameschool -
                        ave(pctsumyessameschool, id),
                      pctsumyessamestate = pctsumyessamestate -
                        ave(pctsumyessamestate, id),
                      pctsumyessameparty = pctsumyessameparty -
                        ave(pctsumyessameparty, id)
                      ))

# comparison of dummy fixed effects estimator and within estimator
dummy_time <- system.time(fit_dummy <- 
              lm(yes ~ pctsumyessameschool + 
                       pctsumyessamestate +
                       pctsumyessameparty + 
                       factor(id) -1, data = cm
                         ))
within_time <- system.time(fit_within <- 
                             lm(yes ~ pctsumyessameschool +
                                      pctsumyessamestate + 
                                      pctsumyessameparty -1, 
                                      data = cm_within))
# computation time comparison
as.numeric(within_time[3])/as.numeric(dummy_time[3])
## [1] 0.009609
# comparison of estimates
coeftest(fit_dummy)[1:3,]
##                     Estimate Std. Error t value
## pctsumyessameschool  0.04424   0.001352   32.73
## pctsumyessamestate   0.11864   0.001085  109.30
## pctsumyessameparty   0.92615   0.001397  662.93
##                       Pr(>|t|)
## pctsumyessameschool 1.205e-234
## pctsumyessamestate   0.000e+00
## pctsumyessameparty   0.000e+00
coeftest(fit_within)
## 
## t test of coefficients:
## 
##                     Estimate Std. Error t value
## pctsumyessameschool  0.04424    0.00135    32.7
## pctsumyessamestate   0.11864    0.00109   109.3
## pctsumyessameparty   0.92615    0.00140   663.0
##                     Pr(>|t|)    
## pctsumyessameschool   <2e-16 ***
## pctsumyessamestate    <2e-16 ***
## pctsumyessameparty    <2e-16 ***
## ---
## Signif. codes:  
## 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Unfortunately, we cannot simply apply the same procedure in a specification with several fixed effects variables. However, Gaure (2013b) provides a generalization of the linear within-estimator to several fixed effects variables. This method is implemented in the lfe package (Gaure 2013a). With this package, we can easily estimate both fixed-effect specifications (as well as the corresponding cluster-robust standard errors) in order to replicate the original results by Cohen and Malloy (2014).

library(lfe)

# model and clustered SE specifications
model1 <- yes ~ pctsumyessameschool + 
                pctsumyessamestate + 
                pctsumyessameparty |congress+id|0|id
model2 <- yes ~ pctsumyessameschool + 
                pctsumyessamestate + 
                pctsumyessameparty |congress_session_votenumber+id|0|id

# estimation
fit1 <- felm(model1, data=cm)
fit2 <- felm(model2, data=cm)

Finally we can display the regression table.

stargazer::stargazer(fit1,fit2,
                     type="text",
                     dep.var.labels = "Vote (yes/no)",
                     covariate.labels = c("School Connected Votes",
                                          "State Votes",
                                          "Party Votes"),
                     keep.stat = c("adj.rsq", "n"))
## 
## ===================================================
##                            Dependent variable:     
##                        ----------------------------
##                               Vote (yes/no)        
##                             (1)            (2)     
## ---------------------------------------------------
## School Connected Votes    0.045***      0.052***   
##                           (0.016)        (0.016)   
##                                                    
## State Votes               0.119***      0.122***   
##                           (0.013)        (0.012)   
##                                                    
## Party Votes               0.926***      0.945***   
##                           (0.022)        (0.024)   
##                                                    
## ---------------------------------------------------
## Observations              425,653        425,653   
## Adjusted R2                0.641          0.641    
## ===================================================
## Note:                   *p<0.1; **p<0.05; ***p<0.01

12.2 Case study: Loops, memory, and vectorization

We first read the economics dataset into R and extend it by duplicating its rows to get a slightly larger dataset (this step can easily be adapted to create a very large dataset).

# read dataset into R
economics <- read.csv("data/economics.csv")
# have a look at the data
head(economics, 2)
##         date   pce    pop psavert uempmed unemploy
## 1 1967-07-01 507.4 198712    12.5     4.5     2944
## 2 1967-08-01 510.5 198911    12.5     4.7     2945
# create a 'large' dataset out of this
for (i in 1:3) {
     economics <- rbind(economics, economics)
}
dim(economics)
## [1] 4592    6

The goal of this code example is to compute real personal consumption expenditures, assuming that pce in the economics dataset provides nominal personal consumption expenditures. Thus, we divide each value in the vector pce by a deflator 1.05.

12.2.1 Naïve approach (ignorant of R)

The first approach we take is based on a simple for loop. In each iteration one element in pce is divided by the deflator, and the resulting value is stored as a new element in the vector pce_real.

# Naïve approach (ignorant of R)
deflator <- 1.05 # define deflator
# iterate through each observation
pce_real <- c()
n_obs <- length(economics$pce)
for (i in 1:n_obs) {
  pce_real <- c(pce_real, economics$pce[i]/deflator)
}

# look at the result
head(pce_real, 2)
## [1] 483.2 486.2

How long does it take?

# Naïve approach (ignorant of R)
deflator <- 1.05 # define deflator
# iterate through each observation
pce_real <- list()
n_obs <- length(economics$pce)
time_elapsed <-
     system.time(
         for (i in 1:n_obs) {
              pce_real <- c(pce_real, economics$pce[i]/deflator)
})

time_elapsed
##    user  system elapsed 
##   0.108   0.000   0.110

Assuming a linear time algorithm (\(O(n)\)), we need that much time for one additional row of data:

time_per_row <- time_elapsed[3]/n_obs
time_per_row
##   elapsed 
## 2.395e-05

If we are dealing with Big Data, say 100 million rows, that is

# in seconds
(time_per_row*100^4) 
## elapsed 
##    2395
# in minutes
(time_per_row*100^4)/60 
## elapsed 
##   39.92
# in hours
(time_per_row*100^4)/60^2 
## elapsed 
##  0.6654

Can we improve this?

12.2.2 Improvement 1: Pre-allocation of memory

In the naïve approach taken above, each iteration of the loop causes R to re-allocate memory because the number of elements in the vector pce_element is changing. In simple terms, this means that R needs to execute more steps in each iteration. We can improve this with a simple trick by initiating the vector to the right size to begin with (filled with NA values).

# Improve memory allocation (still somewhat ignorant of R)
deflator <- 1.05 # define deflator
n_obs <- length(economics$pce)
# allocate memory beforehand
# Initialize the vector to the right size
pce_real <- rep(NA, n_obs)
# iterate through each observation
time_elapsed <-
     system.time(
         for (i in 1:n_obs) {
              pce_real[i] <- economics$pce[i]/deflator
})

Let’s see if this helped to make the code faster.

time_per_row <- time_elapsed[3]/n_obs
time_per_row
##   elapsed 
## 2.178e-06

Again, we can extrapolate (approximately) the computation time, assuming the dataset had millions of rows.

# in seconds
(time_per_row*100^4) 
## elapsed 
##   217.8
# in minutes
(time_per_row*100^4)/60 
## elapsed 
##    3.63
# in hours
(time_per_row*100^4)/60^2 
## elapsed 
## 0.06049

This looks much better, but we can do even better.

12.2.3 Improvement 2: Exploit vectorization

In this approach, we exploit the fact that in R, ‘everything is a vector’ and that many of the basic R functions (such as math operators) are vectorized. In simple terms, this means that a vectorized operation is implemented in such a way that it can take advantage of the similarity of each of the vector’s elements. That is, R only has to figure out once how to apply a given function to a vector element in order to apply it to all elements of the vector. In a simple loop, R has to go through the same ‘preparatory’ steps again and again in each iteration; this is time-intensive.

In this example, we specifically exploit that the division operator / is actually a vectorized function. Thus, the division by our deflator is applied to each element of economics$pce.

# Do it 'the R way'
deflator <- 1.05 # define deflator
# Exploit R's vectorization
time_elapsed <- 
     system.time(
     pce_real <- economics$pce/deflator
          )
# same result
head(pce_real, 2)
## [1] 483.2 486.2

Now this is much faster. In fact, system.time() is not precise enough to capture the time elapsed. In order to measure the improvement, we use microbenchmark::microbenchmark() to measure the elapsed time in microseconds (millionth of a second).

library(microbenchmark)
# measure elapsed time in microseconds (avg.)
time_elapsed <- 
  summary(microbenchmark(pce_real <- economics$pce/deflator))$mean
# per row (in sec)
time_per_row <- (time_elapsed/n_obs)/10^6

Now we get a more precise picture regarding the improvement due to vectorization:

# in seconds
(time_per_row*100^4) 
## [1] 0.1868
# in minutes
(time_per_row*100^4)/60 
## [1] 0.003113
# in hours
(time_per_row*100^4)/60^2 
## [1] 5.189e-05

12.3 Case study: Bootstrapping and parallel processing

In this example, we estimate a simple regression model that aims to assess racial discrimination in the context of police stops.66 The example is based on the ‘Minneapolis Police Department 2017 Stop Dataset’, containing data on nearly all stops made by the Minneapolis Police Department for the year 2017.

We start by importing the data into R.

url <-
"https://vincentarelbundock.github.io/Rdatasets/csv/carData/MplsStops.csv"
stopdata <- data.table::fread(url) 

We specify a simple linear probability model that aims to test whether a person identified as ‘white’ is less likely to have their vehicle searched when stopped by the police. In order to take into account level differences between different police precincts, we add precinct indicators to the regression specification.

First, let’s remove observations with missing entries (NA) and code our main explanatory variable and the dependent variable.

# remove incomplete obs
stopdata <- na.omit(stopdata)
# code dependent var
stopdata$vsearch <- 0
stopdata$vsearch[stopdata$vehicleSearch=="YES"] <- 1
# code explanatory var
stopdata$white <- 0
stopdata$white[stopdata$race=="White"] <- 1

We specify our baseline model as follows.

model <- vsearch ~ white + factor(policePrecinct)

and estimate the linear probability model via OLS (the lm function).

fit <- lm(model, stopdata)
summary(fit)
## 
## Call:
## lm(formula = model, data = stopdata)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.1394 -0.0633 -0.0547 -0.0423  0.9773 
## 
## Coefficients:
##                         Estimate Std. Error t value
## (Intercept)              0.05473    0.00515   10.62
## white                   -0.01955    0.00446   -4.38
## factor(policePrecinct)2  0.00856    0.00676    1.27
## factor(policePrecinct)3  0.00341    0.00648    0.53
## factor(policePrecinct)4  0.08464    0.00623   13.58
## factor(policePrecinct)5 -0.01246    0.00637   -1.96
##                         Pr(>|t|)    
## (Intercept)              < 2e-16 ***
## white                    1.2e-05 ***
## factor(policePrecinct)2     0.21    
## factor(policePrecinct)3     0.60    
## factor(policePrecinct)4  < 2e-16 ***
## factor(policePrecinct)5     0.05 .  
## ---
## Signif. codes:  
## 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.254 on 19078 degrees of freedom
## Multiple R-squared:  0.025,  Adjusted R-squared:  0.0248 
## F-statistic: 97.9 on 5 and 19078 DF,  p-value: <2e-16

A potential problem with this approach (and there might be many more in this simple example) is that observations stemming from different police precincts might be correlated over time. If that is the case, we likely underestimate the coefficient’s standard errors. There is a standard approach to computing estimates for so-called cluster-robust standard errors, which would take the problem of correlation over time within clusters into consideration (and deliver a more conservative estimate of the SEs). However, this approach only works well if the number of clusters in the data is roughly 50 or more. Here we only have five.

The alternative approach is to compute bootstrapped clustered standard errors. That is, we apply the bootstrap resampling procedure at the cluster level. Specifically, we draw \(B\) samples (with replacement), estimate and record the coefficient vector for each bootstrap-sample, and then estimate \(SE_{boot}\) based on the standard deviation of all respective estimated coefficient values.

# load packages
library(data.table)
# set the 'seed' for random numbers (makes the example reproducible)
set.seed(2)

# set number of bootstrap iterations
B <- 10
# get selection of precincts
precincts <- unique(stopdata$policePrecinct)
# container for coefficients
boot_coefs <- matrix(NA, nrow = B, ncol = 2)
# draw bootstrap samples, estimate model for each sample
for (i in 1:B) {
     
     # draw sample of precincts (cluster level)
     precincts_i <- base::sample(precincts, size = 5, replace = TRUE)
     # get observations
     bs_i <- 
          lapply(precincts_i, function(x){
               stopdata[stopdata$policePrecinct==x,]
     } )
     bs_i <- rbindlist(bs_i)
     
     # estimate model and record coefficients
     boot_coefs[i,] <- coef(lm(model, bs_i))[1:2] # ignore FE-coefficients
}

Finally, let’s compute \(SE_{boot}\).

se_boot <- apply(boot_coefs, 
                 MARGIN = 2,
                 FUN = sd)
se_boot
## [1] 0.004043 0.004690

Note that even with a very small \(B\), computing \(SE_{boot}\) takes some time to compute. When setting \(B\) to over 500, computation time will be substantial. Also note that running this code hardly uses up more memory than the very simple approach without bootstrapping (after all, in each bootstrap iteration the dataset used to estimate the model is approximately the same size as the original dataset). There is little we can do to improve the script’s performance regarding memory. However, we can tell R how to allocate CPU resources more efficiently to handle that many regression estimates.

In particular, we can make use of the fact that most modern computing environments (such as a laptop) have CPUs with several cores. We can exploit this fact by instructing the computer to run the computations in parallel (simultaneously computing on several cores). The following code is a parallel implementation of our bootstrap procedure that does exactly that.

# load packages for parallel processing
library(doSNOW)
# get the number of cores available
ncores <- parallel::detectCores()
# set cores for parallel processing
ctemp <- makeCluster(ncores) # 
registerDoSNOW(ctemp)


# set number of bootstrap iterations
B <- 10
# get selection of precincts
precincts <- unique(stopdata$policePrecinct)
# container for coefficients
boot_coefs <- matrix(NA, nrow = B, ncol = 2)

# bootstrapping in parallel
boot_coefs <- 
     foreach(i = 1:B, .combine = rbind, .packages="data.table") %dopar% {
          # draw sample of precincts (cluster level)
          precincts_i <- base::sample(precincts, size = 5, replace = TRUE)
          # get observations
          bs_i <- lapply(precincts_i, function(x) {
            stopdata[stopdata$policePrecinct==x,]
          })
          bs_i <- rbindlist(bs_i)
          # estimate model and record coefficients
          coef(lm(model, bs_i))[1:2] # ignore FE-coefficients
     }
# be a good citizen and stop the snow clusters
stopCluster(cl = ctemp)

As a last step, we again compute \(SE_{boot}\).

se_boot <- apply(boot_coefs, 
                 MARGIN = 2,
                 FUN = sd)
se_boot
## (Intercept)       white 
##    0.002446    0.003017

12.3.1 Parallelization with an EC2 instance

This short tutorial illustrates how to scale up the computation of clustered standard errors shown above by running it on an AWS EC2 instance. Note that there are a few things that we need to keep in mind to make the script run on an AWS EC2 instance in RStudio Server. First, our EC2 instance is a Linux machine. When running R on a Linux machine, there is an additional step to install R packages (at least for most of the packages): R packages need to be compiled before they can be installed. The command to install packages is exactly the same (install.packages()), and normally you only notice a slight difference in the output shown in the R console during installation (and the installation process takes a little longer than you are used to). Apart from that, using R via RStudio Server in the cloud looks/feels very similar if not identical to when using R/RStudio locally. For this step of the case study, first follow the instructions of how to set up an AWS EC2 instance with R/RStudio Server in Chapter 7. Then, open a browser window, log in to RStudio Server on the EC2 instance, and copy and paste the code below to a new R-file on the EC2 instance (note that you might have to install the data.table and doSNOW packages before running the code).

When executing the code below line-by-line, you will notice that essentially all parts of the script work exactly as on your local machine. This is one of the great advantages of running R/RStudio Server in the cloud. You can implement your entire data analysis locally (based on a small sample), test it locally, and then move it to the cloud and run it at a larger scale in exactly the same way (even with the same Graphical User Interface (GUI)).

# install packages
install.packages("data.table")
install.packages("doSNOW")
# load packages
library(data.table)

# fetch the data
url <- "https://vincentarelbundock.github.io/Rdatasets/csv/carData/MplsStops.csv"
stopdata <- read.csv(url)
# remove incomplete obs
stopdata <- na.omit(stopdata)
# code dependent var
stopdata$vsearch <- 0
stopdata$vsearch[stopdata$vehicleSearch == "YES"] <- 1
# code explanatory var
stopdata$white <- 0
stopdata$white[stopdata$race == "White"] <- 1

# model fit
model <- vsearch ~ white + factor(policePrecinct)
fit <- lm(model, stopdata)
summary(fit)
# bootstrapping: normal approach set the 'seed' for random
# numbers (makes the example reproducible)
set.seed(2)
# set number of bootstrap iterations
B <- 50
# get selection of precincts
precincts <- unique(stopdata$policePrecinct)
# container for coefficients
boot_coefs <- matrix(NA, nrow = B, ncol = 2)
# draw bootstrap samples, estimate model for each sample
for (i in 1:B) {
  # draw sample of precincts (cluster level)
  precincts_i <- base::sample(precincts, size = 5, replace = TRUE)
  # get observations
  bs_i <- lapply(precincts_i, function(x) {
    stopdata[stopdata$policePrecinct == x, ]
  })
  bs_i <- rbindlist(bs_i)
  # estimate model and record coefficients
  boot_coefs[i, ] <- coef(lm(model, bs_i))[1:2]  # ignore FE-coefficients
}

se_boot <- apply(boot_coefs, MARGIN = 2, FUN = sd)
se_boot

So far, we have only demonstrated that the simple implementation (non-parallel) works both locally and in the cloud. However, the real purpose of using an EC2 instance in this example is to make use of the fact that we can scale up our instance to have more CPU cores available for the parallel implementation of our bootstrap procedure. Recall that running the script below on our local machine will employ all cores available to us and compute the bootstrap resampling in parallel on all these cores. Exactly the same thing happens when running the code below on our simple t2.micro instance. However, this type of EC2 instance only has one core. You can check this when running the following line of code in RStudio Server (assuming the doSNOW package is installed and loaded): parallel::detectCores() .

When running the entire parallel implementation below, you will thus notice that it won’t compute the bootstrap SE any faster than with the non-parallel version above. However, by simply initiating another EC2 type with more cores, we can distribute the workload across many CPU cores, using exactly the same R script.

# bootstrapping: parallel approaach
# install.packages("doSNOW", "parallel")
# load packages for parallel processing
library(doSNOW)
# set cores for parallel processing
ncores <- parallel::detectCores()
ctemp <- makeCluster(ncores) 
registerDoSNOW(ctemp)
# set number of bootstrap iterations
B <- 50
# get selection of precincts
precincts <- unique(stopdata$policePrecinct)
# container for coefficients
boot_coefs <- matrix(NA, nrow = B, ncol = 2)

# bootstrapping in parallel
boot_coefs <- 
  foreach(i = 1:B, .combine = rbind, .packages="data.table") %dopar% {
    # draw sample of precincts (cluster level)
    precincts_i <- base::sample(precincts, size = 5, replace = TRUE)
    # get observations
    bs_i <- lapply(precincts_i, function(x){
         stopdata[stopdata$policePrecinct==x,])
    } 
    bs_i <- rbindlist(bs_i)
    
    # estimate model and record coefficients
    coef(lm(model, bs_i))[1:2] # ignore FE-coefficients
  }

# be a good citizen and stop the snow clusters
stopCluster(cl = ctemp)
# compute the bootstrapped standard errors
se_boot <- apply(boot_coefs, 
                 MARGIN = 2,
                 FUN = sd)

References

Cohen, Lauren, and Christopher J. Malloy. 2014. “Friends in High Places.” American Economic Journal: Economic Policy 6 (3): 63–91. https://doi.org/10.1257/pol.6.3.63.
Gaure, Simen. 2013a. lfe: Linear Group Fixed Effects.” The R Journal 5 (2): 104–16. https://doi.org/10.32614/RJ-2013-031.
———. 2013b. OLS with Multiple High Dimensional Category Variables.” Computational Statistics & Data Analysis 66: 8–18. https://doi.org/https://doi.org/10.1016/j.csda.2013.03.024.
———. 2022. Foreign: Read Data Stored by ’Minitab’, ’s’, ’SAS’, ’SPSS’, ’Stata’, ’Systat’, ’Weka’, ’dBase’, ... https://CRAN.R-project.org/package=foreign.
Zeileis, Achim, and Torsten Hothorn. 2002. “Diagnostic Checking in Regression Relationships.” R News 2 (3): 7–10. https://CRAN.R-project.org/doc/Rnews/.

  1. Note that this example aims to illustrate a point about computation in an applied econometrics context. It does not make any argument whatsoever about identification or the broader research question.↩︎