2

I am working with a data with 900,000 observations. There is a categorical variable x with 966 unique value that needs to be used as fixed effects. I am including fixed effects using factor(x) in the regression. It gives me an error like this

Error: cannot allocate vector of size 6.9 Gb

How to fix this error? or do I need to do something different in the regression for fixed effects?

Then, how do I run a regression like this:

rlm(y~x+ factor(fe), data=pd)
Ben Bolker
  • 211,554
  • 25
  • 370
  • 453
D_B
  • 131
  • 5

1 Answers1

1

The set of dummy variables constructed from a factor has very low information content. For example, considering only the columns of your model matrix corresponding to your 966-level categorical predictor, each row contains exactly one 1 and 965 zeros.

Thus you can generally save a lot of memory by constructing a sparse model matrix using Matrix::sparse.model.matrix() (or MatrixModels::model.Matrix(*, sparse=TRUE) as suggested by the sparse.model.matrix documentation). However, to use this it's necessary for whatever regression machinery you're using to accept a model matrix + response vector rather than requiring a formula (for example, to do linear regression you would need sparse.model.matrix + lm.fit rather than being able to use lm).

In contrast to @RuiBarradas's estimate of 3.5Gb for a dense model matrix:

m <- Matrix::sparse.model.matrix(~x,
     data=data.frame(x=factor(sample(1:966,size=9e5,replace=TRUE))))
format(object.size(m),"Mb")
## [1] "75.6 Mb"

If you are using the rlm function from the MASS package, something like this should work:

library(Matrix)
library(MASS)
mm <- sparse.model.matrix(~x + factor(fe), data=pd)
rlm(y=pd$y, x=mm, ...)

Note that I haven't actually tested this (you didn't give a reproducible example); this should at least get you past the step of creating the model matrix, but I don't know if rlm() does any internal computations that would break and/or make the model matrix non-sparse.

Ben Bolker
  • 211,554
  • 25
  • 370
  • 453
  • Thanks! But to run rlm(y~x+ factor(fe), data=pd), what should I do exactly? How do I write your above command for it? – D_B Feb 23 '20 at 21:45
  • I've answered this. Note that this is the kind of information that's really useful to include in your question in the first place ... – Ben Bolker Feb 24 '20 at 00:42
  • No! I got the same error, this time 5.9 Gb. I simplified my problem above: I have two series of fixed effects, one has 966 unique values, the other has 170 unique values. Is it why I get the error? – D_B Feb 25 '20 at 03:09
  • are both factors? That is, is your model `y~ factor(x1)+factor(x2)` ? As I said, it's possible that `MASS::rlm` does some internal computation at some point that coerces the model matrix to dense ... – Ben Bolker Feb 25 '20 at 09:13