I'm trying to fit a linear model with roughly 900,000 observations and just two explanatory variables. Yet, I additionally need to include a control variable that is a many-level factor variable (11,135 levels). The code for the regression looks like this:
model1 <- dep_var ~ expl_var_1 + expl_var_2 + factor(control_var), data=data
However, R throws me the error "Cannot allocate a vector of size 75.6 GB" I'm well aware that this is due to the many-level factor variable, however, I need to include this variable as a control. Please note: this is not an ordered factor; it is simply an id without any order.
I've tried to find a solution to this problem, but ran into problems:
- I looked into plm - but that doesn't work because while my control variable can be interpreted as an ID time doesn't play a role (and even if it did; there can be >1 observation per ID per time)
- I looked into biglm but this fits better the case of big data and not many-level factor
My questions:
- Is there a way to include a variable in the regression and leaving it out when assigning the outcome of the regression to model1? I'm really not interested at all in the coefficients per control variable factor level. I just need to control for it.
- If there isn't: can I efficiently split up my regression even if I cannot make sure that in each chunk there are all control variable factor levels present (that isn't feasible, because some levels just have 1 observation)?
I'd appreciate any starting points for a solution and ideas where to look for a solution - currently I'm just stuck with my level of knowledge and understanding.
Thanks in advance for your time, support, and patience.