R: Fast way to create a sparse model matrix

Question

I am trying to create a model matrix with a formula that has many interaction terms (some continuous, some 0-1, some factors with many levels). The creation of this model matrix is the bottleneck of my script. In the end the model matrix is 8M rows with 1000 columns. Since the factors with many levels are 0-1 encoded the resulting matrix representing interactions is very sparse, so I already use sparse.model.matrix.

Is there a faster way to generate this matrix? Perhaps in Rcpp?

maybe profile `sparse.model.matrix` to see where the bottlenecks are? — Ben Bolker, Oct 04 '15 at 23:20
It would be nice if you''d provide a MWE too so we could get a better idea of what you dealing with. — David Arenburg, Oct 06 '15 at 13:28
For further comparison see: http://stackoverflow.com/questions/31373710/r-fast-way-to-create-a-sparse-model-matrix — Love-R, May 12 '16 at 11:44

score 5 · Answer 1 · edited May 23 '17 at 10:28

Have you considered using caret's dummyVars? It works for me and seems reasonably fast.

?dummyVars compares the default behavior of model.matrix and dummyVars, but doesn't say much about it.

For a small performance benchmark on a reproducible example:

n = 1e3 # observations
m = 1e2 # variables
some_levels <- sort(c(LETTERS, letters))
library('microbenchmark')
set.seed(1234)

df <- data.frame(
       lapply(1:m, function(x){
                    switch(sample.int(3,1),    
                           # "some continuous, some 0-1"
                           '1' = rnorm(n), '2' = rbinom(n, 1, 0.5),
                           # "some factors with many levels"       
                           '3' = factor(sample(some_levels, n, TRUE),
                                        levels=some_levels )
                          )
                        })
               )
names(df) <- paste0('V',1:m)

#------------- it sounds like you are doing something like this --------------
frm <- as.formula( paste('~', paste(names(df), collapse='+') ) )
library('Matrix')
microbenchmark(
  mm <- sparse.model.matrix(frm, df)
) # mean = .133 sec (YMMV)

#---------------- you could try something like this --------------------------
library('caret')
microbenchmark(
  mm2 <- dummyVars(frm, df, fullRank=TRUE)
) # mean = .00954 sec (YMMV)

Note fullRank = TRUE so that "factors are encoded to be consistent with model.matrix and the resulting there [sic] are no linear dependencies induced between the columns," per ?dummyVars. You might want to remove fullRank = TRUE to induce the behavior of sparse=TRUE in contr.ltrf as in sparse.model.matrix. I could not find clear documentation.

Doesn't `dummyVars` just create a map? Don't you need a predict statement too? Like `mm3 <- predict(mm2, df)`? — screechOwl, Oct 26 '17 at 21:10

R: Fast way to create a sparse model matrix

1 Answers1