How to deal with too large vector size in lm with many-level factor as control

Question

I'm trying to fit a linear model with roughly 900,000 observations and just two explanatory variables. Yet, I additionally need to include a control variable that is a many-level factor variable (11,135 levels). The code for the regression looks like this:

model1 <- dep_var ~ expl_var_1 + expl_var_2 + factor(control_var), data=data

However, R throws me the error "Cannot allocate a vector of size 75.6 GB" I'm well aware that this is due to the many-level factor variable, however, I need to include this variable as a control. Please note: this is not an ordered factor; it is simply an id without any order.

I've tried to find a solution to this problem, but ran into problems:

I looked into plm - but that doesn't work because while my control variable can be interpreted as an ID time doesn't play a role (and even if it did; there can be >1 observation per ID per time)
I looked into biglm but this fits better the case of big data and not many-level factor

My questions:

Is there a way to include a variable in the regression and leaving it out when assigning the outcome of the regression to model1? I'm really not interested at all in the coefficients per control variable factor level. I just need to control for it.
If there isn't: can I efficiently split up my regression even if I cannot make sure that in each chunk there are all control variable factor levels present (that isn't feasible, because some levels just have 1 observation)?

I'd appreciate any starting points for a solution and ideas where to look for a solution - currently I'm just stuck with my level of knowledge and understanding.

Thanks in advance for your time, support, and patience.

Hi Stefano, thanks for this pointer. I looked into it and wanted to ask you a follow up question. In my case, I would run a linear mixed model in which the random effects model would include the factor variable? Thanks in advance — Odysseus, Mar 17 '22 at 12:00
Yes, that is what the description of your problem suggested to me. — Stefano Barbi, Mar 17 '22 at 12:03
You will want to look into developing a sparse matrix for the model. To regress on your data the factor will be converted into dummy columns, and a 900k x 11k dense matrix takes up a lot of space. [This vignette](https://cran.r-project.org/web/packages/Matrix/vignettes/sparseModels.pdf) is a decent intro. — Gregor Thomas, Mar 17 '22 at 12:49
Thanks, Gregor. This is also a very helpful pointer. I'll definitely look into it. — Odysseus, Mar 17 '22 at 13:26
https://stackoverflow.com/questions/3169371/large-scale-regression-in-r-with-a-sparse-feature-matrix — Ben Bolker, Mar 17 '22 at 14:17
I checked out the avenue you suggested, Gregor, but I seem to hit a dead end after having created the sparse matrix. Passing it on to a linear regression seems currently to be under development - or I'm just not finding the right info. Could you help me out? — Odysseus, Mar 17 '22 at 14:18

score 0 · Answer 1 · answered Nov 10 '22 at 21:01

I am late to the party, but actually don't see why biglm would not work. You would not need to have all control as dummies, but as one factor, thus making the problem much less sparse. The only thing is to create chunks of the data ahead of the biglm (which you can do with split or sample and split), run biglm on the first chunk then on the other chunks using the biglm::update function. The number of chunks will depend on your memory.

The only thing is to make sure you define the levels of factors in each chunk the exact same way (using levels with/out relevel before chunking). For those factors absent from a chunk, biglm will return a NA, which will be updated in the next stages.

Your answer could be improved with additional supporting information. Please [edit] to add further details, such as citations or documentation, so that others can confirm that your answer is correct. You can find more information on how to write good answers [in the help center](/help/how-to-answer). — Community, Nov 15 '22 at 04:52

score 0 · Accepted Answer · edited Apr 25 '23 at 17:12

The lfe library (Linear Fixed Effects) provides the felm() function. It works similarly to what Stefano Barbi suggested but does not treat the factor variable as random effects but as fixed effects. This is much closer to what I initially wanted. Also, the felm() objects are compatible with sandwich so that I can cluster my standard errors - though not on the absorbed factor variable (which in my case is not a problem):

install.packages("lfe") 
library(lfe) 
model1 <- felm(dep_var ~ expl_var_1 + expl_var_2 | factor(control_var), data=data)

The felm() function is incredibly fast compared to lm() - at least in my specific case. lfe comes with a number of different vignettes for different specific problems. From the help file "The package uses the Method of Alternating Projections to estimate linear models with multiple group fixed effects. A generalization of the within estimator. It supports IV-estimation with multiple endogenous variables via 2SLS, with conditional F statistics for detection of weak instruments. It is thread-parallelized and intended for large problems. A method for correcting limited mobility bias is also included."

Only drawback I have found so far: felm() objects are not compatible with margins. So you have to compute margins by hand.

How to deal with too large vector size in lm with many-level factor as control

2 Answers2