did2s (Gardner 2021)

Table of contents

  1. Introduction
  2. Installation and options
  3. Dataset
  4. Test the package

Introduction

The did2s R package by Kyle Butts implements the method proposed by the Gardner 2021 paper Two-stage differences in differences. A detailed description is provided on the did2s website.

The key idea behind did2s is pretty simple and is clever implementation of the Frisch-Waugh-Lovell (FWL) theorem that should be familiar to many readers. In short, we can avoid the pathologies of staggered DiD settings by running two regressions (hence the name):

  1. Run a fixed-effects regression (including controls) on our outcome variable using only the untreated/not-yet-treated observations. Residualize the outcomes (i.e., subtract the predicted outcomes from the actual outcomes).
  2. Regress the residualized outcomes on the treatment dummies (or, time-to-treament dummies for an event study) using all obsevations to get the unbiased treatment effects.

A virtue of this two-step procedure is that it is very quick to estimate. Underneath the hood, did2s calls fixest, so all of the associated methods of the latter package are available (tabling, plotting, etc.). More importantly, it shares some syntax shortcuts/conventions that we should use for specifying our models. We’ll see some examples of this below.

Before continuing, it is worth noting that the did2s package also provides convenience functions for running and visualizing a range of DiD estimators (i.e., not just the method proposed by Gardner 2021). This makes it a very useful package to have in the applied econometrican’s R toolkit. We’ll save this functionality for the “All estimators” section, though. STILL NEED TO ADD THIS.

Installation and options

The package can be installed from CRAN.

install.packages("did2s") # Install (only need to run once or when updating)
library("did2s")          # Load the package into memory (required each new session)

The core estimating function is did2s() and it takes the following arguments:

did2s(data, yname, treatment, first_stage, second_stage, ...)

where

Variable Description
data dataset
yname outcome variable (character)
first_stage 1st-stage regression forumula (controls & FEs)
second_stage 2nd-stage regression formula (treatment indicator)
treatment treatment dummy variable (character)
cluster_var how to cluster SEs (character)
Additional arguments (bootstrapping, etc.)

As mentioned above, did2s calls fixest underneath the hood and so expects some the syntax conventions and shortcuts offered by the latter. The most obvious cases are the use of the | fixed-effect slot in first_stage formula, and the use of i() in the second_stage formula. We’ll illustrate this directly with an example below. But you can take a look at the helpfile (?did2s) for detailed information and additional examples.

Dataset

To demonstrate the package in action, we’ll use the fake dataset that we created earlier. Here’s a reminder of what the data look like.

head(dat)
#>   time id        y rel_time treat first_treat
#> 1    1  1 2.158289      -11 FALSE          12
#> 2    2  1 2.498052      -10 FALSE          12
#> 3    3  1 3.034077       -9 FALSE          12
#> 4    4  1 4.886266       -8 FALSE          12
#> 5    5  1 7.085950       -7 FALSE          12
#> 6    6  1 5.788352       -6 FALSE          12

Or, in graph form.

Test the package

Remember to load the package (if you haven’t already).

library(did2s)

Let’s try the basic did2s() command on our fake dataset. We’ll start by estimating a simple binary treatment effect, rather than a full event-study. There are some things that I want to draw your attention to w.r.t. our first- and second-stage regression arguments below. The first_stage formula uses | to demarcate regular covariates (although theren’t any in this dataset) from the fixed-effects slot. The second_stage formula invokes the i() operator to create an indicator (factor) of our “treat” variable. Both of these syntax features should be familiar to fixest users, but are worth underscoring all the same.

did2s(
  data         = dat,
  yname        = "y", 
  first_stage  = ~ 0 | id + time, # 0 b/c we have no controls in this dataset
  second_stage = ~ i(treat),      # binary treatment dummy (not an event-study) 
  treatment    = "treat",
  cluster_var  = "id",
  )

#' Running Two-stage Difference-in-Differences
#' • first stage formula `~ 0 | id + time`
#' • second stage formula `~ i(treat)`
#' • The indicator variable that denotes when treatment is on is `treat`
#' • Standard errors will be clustered by `id`
#' OLS estimation, Dep. Var.: y
#' Observations: 1,800 
#' Standard-errors: Custom 
#'                   Estimate   Std. Error  t value  Pr(>|t|)    
#' treat::FALSE -5.010000e-15 1.070000e-15 -4.69674 2.844e-06 ***
#' treat::TRUE   1.399854e+02 1.216437e+01 11.50782 < 2.2e-16 ***
#' ---
#' Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#' RMSE: 80.9   Adj. R2: 0.425782

Okay, now let’s try a more realistic use-case by estimating a full event-study. The only change we need to make vis-à-vis the previous regression is in the second_stage formula. This time, we’ll use a relative time variable (AKA time-to-treatment) rather than a binary treatment indicator. Note that I’m specifying two reference periods via i(rel_time, ref = c(-1, Inf)). The first (-1) sets all effects relative to the period immediately before treatment. The second (Inf) establishes the “never-treated” control group for this dataset; this reference value may differ for your own data. I’ll also go ahead and save the resulting model object, since I plan to graph the corresponding event-study plot below.

es_mod = did2s(
  data         = dat,
  yname        = "y",
  first_stage  = ~ 0 | id + time,
  second_stage = ~ i(rel_time, ref = -c(1, Inf)), # Use relative time var. for event-study
  treatment    = "treat",
  cluster_var  = "id"
  )

#' Running Two-stage Difference-in-Differences
#' • first stage formula `~ 0 | id + time`
#' • second stage formula `~ i(rel_time, ref = -c(1, Inf))`
#' • The indicator variable that denotes when treatment is on is `treat`
#' • Standard errors will be clustered by `id`

es_mod
#' OLS estimation, Dep. Var.: y
#' Observations: 1,800 
#' Standard-errors: Custom 
#' rel_time::-43  -0.049719   0.272847   -0.182224 0.8554282    
#' rel_time::-42  -0.150764   0.196905   -0.765670 0.4439783    
#' rel_time::-41   0.411294   0.419485    0.980472 0.3269918    
#' rel_time::-40  -0.060809   0.285788   -0.212777 0.8315264    
#' rel_time::-39  -0.104809   0.445865   -0.235069 0.8141833    
#' <TRUNCATED>
#' ---
#' Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#' RMSE: 25.7   Adj. R2: 0.939163

As I keep emphasizing, did2s is using fixest underneath the hood and all of the latter methods apply. So we plot our event-study model using fixest::iplot().

# fixest::iplot(es_mod) 
# Vanilla option (above) is fine, but we can tweak a bit...
es_mod |>
  fixest::iplot(
    main     = "did2s",
    xlab     = "Time to treatment",
    drop     = "[[:digit:]]{2}",    # Drop any leads/lags greater than |9|
    ref.line = 1
  )

P.S. For those of you that would prefer a ggplot2 version of the above (base R) plot, check out ggfixest.