Memory issue when using factor for fixed effect regression

Question

I am working with a data with 900,000 observations. There is a categorical variable x with 966 unique value that needs to be used as fixed effects. I am including fixed effects using factor(x) in the regression. It gives me an error like this

Error: cannot allocate vector of size 6.9 Gb

How to fix this error? or do I need to do something different in the regression for fixed effects?

Then, how do I run a regression like this:

rlm(y~x+ factor(fe), data=pd)

996 columns times 9e5 is 896400000. Times 4 bytes per integer is 3.5 Gb. And R generally needs more memory for its internal operations. — Rui Barradas, Feb 23 '20 at 20:06
Apart from buying more memory? [Here](https://stackoverflow.com/a/47999406/8245406) are some tips. — Rui Barradas, Feb 23 '20 at 20:28

Ben Bolker · Answer 1 · 2020-02-24T00:42:03.780

The set of dummy variables constructed from a factor has very low information content. For example, considering only the columns of your model matrix corresponding to your 966-level categorical predictor, each row contains exactly one 1 and 965 zeros.

Thus you can generally save a lot of memory by constructing a sparse model matrix using Matrix::sparse.model.matrix() (or MatrixModels::model.Matrix(*, sparse=TRUE) as suggested by the sparse.model.matrix documentation). However, to use this it's necessary for whatever regression machinery you're using to accept a model matrix + response vector rather than requiring a formula (for example, to do linear regression you would need sparse.model.matrix + lm.fit rather than being able to use lm).

In contrast to @RuiBarradas's estimate of 3.5Gb for a dense model matrix:

m <- Matrix::sparse.model.matrix(~x,
     data=data.frame(x=factor(sample(1:966,size=9e5,replace=TRUE))))
format(object.size(m),"Mb")
## [1] "75.6 Mb"

If you are using the rlm function from the MASS package, something like this should work:

library(Matrix)
library(MASS)
mm <- sparse.model.matrix(~x + factor(fe), data=pd)
rlm(y=pd$y, x=mm, ...)

Note that I haven't actually tested this (you didn't give a reproducible example); this should at least get you past the step of creating the model matrix, but I don't know if rlm() does any internal computations that would break and/or make the model matrix non-sparse.

Thanks! But to run rlm(y~x+ factor(fe), data=pd), what should I do exactly? How do I write your above command for it? — D_B, Feb 23 '20 at 21:45
I've answered this. Note that this is the kind of information that's really useful to include in your question in the first place ... — Ben Bolker, Feb 24 '20 at 00:42
No! I got the same error, this time 5.9 Gb. I simplified my problem above: I have two series of fixed effects, one has 966 unique values, the other has 170 unique values. Is it why I get the error? — D_B, Feb 25 '20 at 03:09
are both factors? That is, is your model `y~ factor(x1)+factor(x2)` ? As I said, it's possible that `MASS::rlm` does some internal computation at some point that coerces the model matrix to dense ... — Ben Bolker, Feb 25 '20 at 09:13

Memory issue when using factor for fixed effect regression

1 Answers1