Conley standard errors and high dimensional fixed effects

Patrick Baylis 2015-10-12 2 minute read

Update (April 2018): This post deprecated, as these days I estimate spatial econometric models using clustered standard errors and in R. There are Conley implementations in R, but so far as I can tell the technique has become less popular over time and all of those that I know of suffer from performance issues.

For my JMP, I cluster my standard errors two ways, across both geographies and time. During a recent seminar, one of the audience members asked me why I wasn’t using spatial standard errors instead, for example those described in Conley (2008).

A case where this might matter is as follows: suppose I’m worried about correlated variation in both my left- and right-hand side variables between observations that are near each other (putting aside correlation across time for now since the concept is equivalent). One typical solution, equivalent to what I was using, is to cluster at some geographic level, say by county. If the correlations only occur within each county, then this is sufficient. If, however, observations across county lines are correlated (e.g. Kansas City, then the standard errors I estimate may be too small. Conley standard errors solve this problem. In fact, one of my advisers, Sol Hsiang, implements these errors for both Matlab and Stata. I also found a version for R, though I haven’t tested it.

However, I have a lot of data and multiple dimensions of fixed effects, so I am using Sergio Correia’s fantastic reghdfe Stata package. He is planning to implement Conley standard errors, but hasn’t gotten around to it yet. Thiemo Fetzer implements a solution using both Sol’s code, but it uses reg2hdfe (which is similar, but generally slower than reghdfe) and looks complicated.

Instead, I use hdfe, which does the demeaning for reghdfe, to partial out my fixed effects. Then I run Sol’s code on the demeaned data. I’ve posted the code without context below, to give an example:

hdfe afinn_score_std tmean_b*, clear absorb(gridNum statemonth) \\\
    tol(0.001) keepvars(gridNum date lat lng)

ols_spatial_HAC afinn_score_std tmean_b*, lat(lat) lon(lng) \\\
    time(date) panel(gridNum) distcutoff(16) lagcutoff(7) disp

This appears to be much too slow for my dataset (>2 million obs at different locations, more than a year of daily data). I found an updated version in R from Thiemo Fetzer again, but this too isn’t fast enough for my needs, even though the matrix operations are implemented in C++. I may try to write my own (in Julia?) at some point, but for now the best solution will be to estimate them for a subset of my data.