The set of dummy variables constructed from a factor has very low information content. For example, considering only the columns of your model matrix corresponding to your 966-level categorical predictor, each row contains exactly one 1 and 965 zeros.
Thus you can generally save a lot of memory by constructing a sparse model matrix using Matrix::sparse.model.matrix()
(or MatrixModels::model.Matrix(*, sparse=TRUE)
as suggested by the sparse.model.matrix
documentation). However, to use this it's necessary for whatever regression machinery you're using to accept a model matrix + response vector rather than requiring a formula (for example, to do linear regression you would need sparse.model.matrix
+ lm.fit
rather than being able to use lm
).
In contrast to @RuiBarradas's estimate of 3.5Gb for a dense model matrix:
m <- Matrix::sparse.model.matrix(~x,
data=data.frame(x=factor(sample(1:966,size=9e5,replace=TRUE))))
format(object.size(m),"Mb")
## [1] "75.6 Mb"
If you are using the rlm
function from the MASS
package, something like this should work:
library(Matrix)
library(MASS)
mm <- sparse.model.matrix(~x + factor(fe), data=pd)
rlm(y=pd$y, x=mm, ...)
Note that I haven't actually tested this (you didn't give a reproducible example); this should at least get you past the step of creating the model matrix, but I don't know if rlm()
does any internal computations that would break and/or make the model matrix non-sparse.