R and Julia, again

2017/01/19

Update (April 2018): Julia has change quite a bit since the time of this writing. FixedEffectsModels.jl still exists, but I still frequently run into strange errors when estimating. As of now, I still resort to sampling and running large regressions in lfe::felm in R.

I’m trying to use Julia, and specifically FixedEffectsModels.jl, to run fixed effects regressions more quickly. There are complications:

First, I create some fake data in R and export it as both as CSV and as feather.

# create some fake data
n <- 10000000
dt <- data.table(x=rnorm(n),
                 t=rnorm(n, 30, 5),
                 g=floor(runif(n, 0, 10)))
dt[, y:=2 - 3*x + sin(2*t) + cos(3*g) + rnorm(n, 0, 0.1)]
dt[, t_cut:=cut(t, c(-Inf, 20, 30, 40, Inf))]
dt[, t_cut:=relevel(t_cut, 2)] # reorder to omit 20-30
dt[, t_cut_n:=as.numeric(t_cut)] # convert to numeric for export to Julia

write_feather(dt[ ,list(y, x, t_cut_n, g)], "data.feather"))

dt[, `:=`(g=factor(g))] # convert g to factor for felm
felm.out <- felm(y ~ x + t_cut | g | 0 | g, data=dt)

Important: right now, FixedEffectModels.jl uses DataArrays.jl, a deprecated package that deals poorly with factor variables and with NA values. Until it incorporates CategoricalArrays.jl (which handles both nicely), I have to be careful with what I pass from R. In particular, I can’t pass data.frames with NA values or with factor variables. Instead, I have to pass a substitute categorical variable that is actually just an integer. This can be done in R fairly easily, as long as I run relevel first. Then, I can run this:

using DataFrames, FixedEffectModels, Feather

# Import data via feather
@time feather_data = Feather.read("data.feather", nullable=false)

# Pool data
@time feather_data[:t_cut_n] = pool(feather_data[:t_cut_n])
@time feather_data[:gPooled] = pool(feather_data[:g])

# Regress
@time reg(y ~ x + t_cut_n |> gPooled, feather_data, VcovCluster([:gPooled]))

OLD VERSION OF THIS post

I had heard about Julia at various points, usually by people in the data science/Kaggle sphere, but didn’t know much about it. However, for my JMP I need to run a few very big regressions, on the order of a billion observations with multiple dimensions of many-leveled fixed effects. reghdfe(https://github.com/sergiocorreia/reghdfe) and lfe(https://cran.r-project.org/web/packages/lfe/index.html) are the weapons of choice in Stata and R, respectively, for these kinds of models but someone while poking around on the internet I ran across FixedEffectModels.jl. Click that link and compare the speeds. The graph is killer, and it’s not some rigged example - real world experience bears it out.

Suddenly learning Julia became a lot more interesting. I figured, heck, I know Python and R - how hard could it be? It’s actually pretty hard. Julia is meant to be easy to code in (like Python) but fast as hell (like C). And it IS easy to code in, sort of. But there are some gotchas to learning a relatively recent language:

A big part of the issue is that I’m facing is not in getting FixedEffectModels.jl to work, it’s just getting my huge dataset formatted in a way that I can run the darn thing. In principle I could use DataRead.jl, but I haven’t figured out how to compile ReadStat on a PC (the server I’m using). So instead I’m limited to either importing via cleaned CSV or SQLite database. SQLite is much faster but I have to redo all of the cleaning, while CSV is painfully slow and runs out of memory because Julia gets the types wrong (see: “Naturalness”). So now I’m learning how to import the data through SQLite in chunks, type it properly, and merge in the datasets I use in my Stata cleaning process. All before I get to run a single regression. And three weeks before my JMP needs to be done.

But Julia is a nice tool to have, and it seems to have some momentum. The community appears to be tremendous.

More notes (I feel like I’m learning a ton about language design):