7

Suppose you have a data frame with a high number of columns(1000 factors, each with 15 levels). You'd like to create a dummy variable data set, but since it would be too sparse, you would like to keep dummies in sparse matrix format.

My data set is quite big and the less steps there are, the better for me. I know how to do above steps; but I couldn't get my head around directly creating that sparse matrix from the initial data set, i.e. having one step instead of two. Any ideas?

EDIT: Some comments asked for further elaboration, so here it goes:

Where X is my original data set with 1000 columns and 50000 records, each column having 15 levels,

Step1: Creating dummy variables from the original data set with a code like;

# Creating dummy data set with empty values
dummified <- matrix(NA,nrow(X),15*ncol(X))
# Adding values to this data set for each column and each level within columns
for (i in 1:ncol(X)){colFactr <- factor(X[,i],exclude=NULL)
  for (j in 1:l){
    lvl <- levels(colFactr)[j]
    indx <- ((i-1)*l)+j
    dummified[,indx] <- ifelse(colFactr==lvl,1,0)
  }
}

Step2: Converting that huge matrix into a sparse matrix, with a code like;

sparse.dummified <- sparseMatrix(dummified)

But this approach still created this interim large matrix which takes a lot of time & memory, therefore I am asking the direct methodology (if there is any).

Matthew Lundberg
  • 42,009
  • 6
  • 90
  • 112
agondiken
  • 863
  • 1
  • 11
  • 17
  • 1
    Maybe it's just me but I find it quite hard to understand what you are asking for? Can you elaborate a bit or give a small example? Maybe show us what your "two steps" were? – flodel Apr 12 '14 at 22:05
  • Not sure what you are asking...but you can create a matrix of dummy variables in one step `model.matrix(~ -1 + . , data=yourdata)`. Is this what you want? – user20650 Apr 12 '14 at 22:53
  • @flodel: I edited the original question. Hope it's more elaborated. – agondiken Apr 13 '14 at 09:56
  • @user20650 : Your code also creates the dummy matrix but I want it in sparse matrix format directly. – agondiken Apr 13 '14 at 10:13
  • @user20650 : Btw, as far as I understand, model.matrix removes one of the levels in each column, so for example instead of 1000*15 dummy columns you end up with 1000*14 columns. I am assuming this is because the combination of other 14 already gives that 15th column's information, but I still prefer having all 15 there, is there a way around this? – agondiken Apr 13 '14 at 10:13
  • haven't looked carefully at this, but how about `Matrix::sparse.model.matrix` ?? – Ben Bolker Apr 13 '14 at 12:30
  • I saw that someone was referred to this answer today, nearly 8 years later. An easier method now (not sure if it existed when this question was asked!)-- Use the package `sparseMatrixStats`. You install this with `BiocManager::install("sparseMatrixStats")`. Then you only have two arguments, the data frame or matrix you're converting and the type of matrix - `sparse_mat <- as(denseMatrixObject, "dgCMatrix")`. It took less than a second for a 3136 column, 60,000 row object – Kat Dec 03 '21 at 01:18

2 Answers2

11

Thanks for having clarified your question, try this.

Here is sample data with two columns that have three and two levels respectively:

set.seed(123)
n <- 6
df <- data.frame(x = sample(c("A", "B", "C"), n, TRUE),
                 y = sample(c("D", "E"),      n, TRUE))
#   x y
# 1 A E
# 2 C E
# 3 B E
# 4 C D
# 5 C E
# 6 A D

library(Matrix)
spm <- lapply(df, function(j)sparseMatrix(i = seq_along(j),
                                          j = as.integer(j), x = 1))
do.call(cBind, spm)
# 6 x 5 sparse Matrix of class "dgCMatrix"
#               
# [1,] 1 . . . 1
# [2,] . . 1 . 1
# [3,] . 1 . . 1
# [4,] . . 1 1 .
# [5,] . . 1 . 1
# [6,] 1 . . 1 .

Edit: @user20650 pointed out do.call(cBind, ...) was sluggish or failing with large data. So here is a more complex but much faster and efficient approach:

n <- nrow(df)
nlevels <- sapply(df, nlevels)
i <- rep(seq_len(n), ncol(df))
j <- unlist(lapply(df, as.integer)) +
     rep(cumsum(c(0, head(nlevels, -1))), each = n)
x <- 1
sparseMatrix(i = i, j = j, x = x)
flodel
  • 87,577
  • 21
  • 185
  • 223
  • This is really fast! I liked Ben's method also but this last version you produced with user's point is a bit faster. Thanks a lot for the help! – agondiken Apr 14 '14 at 11:50
  • Hi, flodel, great answer. If do.call does not work well, how about `Reduce(cBind, spm)`? – Ping Jin Aug 17 '16 at 04:46
6

This can be done slightly more compactly with Matrix:::sparse.model.matrix, although the requirement to have all columns for all variables makes things a little more difficult.

Generate input:

set.seed(123)
n <- 6
df <- data.frame(x = sample(c("A", "B", "C"), n, TRUE),
                 y = sample(c("D", "E"),      n, TRUE))

If you didn't need all columns for all variables you could just do:

library(Matrix)
sparse.model.matrix(~.-1,data=df)

If you need all columns:

fList <- lapply(names(df),reformulate,intercept=FALSE)
mList <- lapply(fList,sparse.model.matrix,data=df)
do.call(cBind,mList)
Ben Bolker
  • 211,554
  • 25
  • 370
  • 453
  • @ Ben; I was interested in looking at the performance of this (ad Flodels) for a problem scaled to OP's data. If df is defined `df <- data.frame(replicate(1000,sample(letters[1:15], 100, TRUE)))` neither of your solutions run : with the error at do.call `Error in match.fun(FUN) : node stack overflow`. Is this an issue or in my PC. Thanks – user20650 Apr 13 '14 at 15:28
  • on my laptop creating `fList` is nearly instantaneous; creating `mList` takes about 6 seconds; and then I get the node stack overflow. However, if I do `system.time(X <- Reduce(cBind,mList))` it takes 3 seconds and the result is a 100 x 14985 sparse matrix. Perhaps worth contacting the `Matrix` maintainer ... – Ben Bolker Apr 13 '14 at 16:01
  • @ Ben; thanks for confirming - so it's an issue with how `cBind` is working when called in `do.call` when large? – user20650 Apr 13 '14 at 16:13
  • Think so. (You could easily test/explore this by using `cBind` on increasingly larger subsets of the sparse submatrices and see where it breaks ...) – Ben Bolker Apr 13 '14 at 16:49