3

I have sparse data problems that generally require computing a sparse model matrix. The matrix I should receive in the end contains ~95% zeroes. It is usually due to factors that get one hot encoded, which become sparse, and further taking interactions with these sparse vectors.

require(Matrix)
require(data.table)
require(magrittr)

n = 500000
p = 10
x.matrix = matrix(rnorm(n*p), n, p)
colnames(x.matrix) = sprintf("n%s", 1:p)

x.categorical = data.table(
  c1 = sample(LETTERS, n, replace = T),
  c2 = sample(LETTERS, n, replace = T),
  c3 = sample(LETTERS, n, replace = T)
)

x = cBind(x.matrix, x.categorical)

myformula = "~ n1 + n2 + n3 + n4 + n5 + n6 + n7 + n8 + n9 +n10 +
  c1 + c1*c2 + c3 + n1:c1"
mm = model.matrix(myformula %>% as.formula, x)
mm2 = sparse.model.matrix(myformula %>% as.formula, x)

I have found that the performance of sparse.model.matrix on a sparse problem is worse than model.matrix (normally used for dense problems). This is revealed using Rstudio's profiling tools.

sparse.model.matrix poor performance

Here sparse model matrix takes much more time than model.matrix, and uses almost the same amount of memory. In some problems I have found sparse.model.matrix to be up to 10x slower than model.matrix when working with data that should be sparse.

Are there better ways to create the sparse matrix? I have searched quite a lot and have not found any. Alternatively, I would be interested in finding others or getting tips in how to implement a smarter version of sparse.model.matrix from scratch, perhaps using Rcpp or data.table functions

The source of the problem is in sparse2int, although I don't quite understand what it is for, and there are a few "FIXME"s still left in the code.

nrussell
  • 18,382
  • 4
  • 47
  • 60
JCWong
  • 1,163
  • 1
  • 10
  • 29
  • 2
    maybe flodel's answer could be helpful https://stackoverflow.com/questions/23035982/directly-creating-dummy-variable-set-in-a-sparse-matrix-in-r – user20650 Aug 21 '17 at 23:01
  • @user20650 awesome that is a great answer! It leaves out interactions though, and lacks the expressiveness that formulas give you. Admittedly the OP could have used sparse.model.matrix and passed in the contrasts, it would just be slow for the above reasons in my post – JCWong Aug 22 '17 at 00:11

0 Answers0