1

I'm trying to use a sparse matrix to generate dummy variables for a set of data with 5.8 million rows and two categorical columns.

The structure of the data is:

mydata: data.table of 5,800,000 rows and two categorical (in integer format) variables Var1 and Var2

nlevel(Var1) : 210,000 (levels include all numbers between 1 and 210,000)

nlevel(Var2) : 500 (levels include all numbers between 1 and 500)

here's an example of mydata:

 Var_1      Var_2
   1          4
   1          2
   2          7
   5          9
   5          500
   .
   .
   .

  200         6
  200         2
  200         80
   .
   .
   .

I'm using a sparse Matrix (sparse_Mx) to create the dummy variable matrix which would be of the form:

Var_1       Var_2_level_1     Var_2_level_2   . . .    Var_2_level_500
  1                0                   1                    0
  2                0                   0                    0
  3                1                   1                    0
  4                0                   0                    0
  5                0                   0                    1

  .
  .
  .

 200              0                    1                    0
  .
  .
  .

210,000           ...                 ...                  ...

I didn't know how to do this efficiently, so i used a for-loop to create the dummy variable matrix:

library(Matrix) #for sparse matrices
m2 <- Matrix(0, nrow = 210000, ncol = 500 , sparse = TRUE) 

for (i in 1: nrow(mydata))
  sparse_Mx[ mydata[i, Var_1] , mydata[i, Var_2] ] <- 1

It basically goes through each row of mydata, and based on the row Var1 value (which determines the row in the matrix) and the row Var2 value (which determines the column number in the matrix, fills the sparse matrix with 1.

It works, except it's taking forever (as the for-loop has to go through 5,800,000 loops!)

Is there any way to do this more efficiently? I really dislike using for-loop for this purpose but couldn't think of another way to do this.


Edit: I'd like to add that I have tried using sparse.model.matrix(), to no avail. the generated matrix is not in the right format (210,000 rows and 500 columns).

The variables were converted to factors and used the following:

sp_mx <- sparse.model.matrix( ~ . -1 , data = mydata)

However, I get a sparse matrix of [5,800,000 x 500 ] as opposed to a matrix of [210,000 x 500]

I've tried many variations and still same result:

sp_mx <- sparse.model.matrix( ~ Var2 -1 , data = mydata)

or

 sp_mx <- sparse.model.matrix(Var1 ~ Var2 -1 , data = mydata)

all of them result in a sparse matrix with all rows. what i need is a [210,000 x 500] matrix that will have more than one 1 in each row.

Ankhnesmerira
  • 1,386
  • 15
  • 29
  • If I understood it correctly, the columns in mydata (i.e. Var_1 and Var_2) represent the locations where you want value 1 in the sparse matrix? – tushaR Jun 22 '17 at 08:40
  • yes, that's correct. For example if Var1 = 3 and Var2 = 400 then the matrix should have a value of 1 in its [3,400] location. The values of Var1 range from 1 to 210,000 (number of levels) and the values of Var2 range from 1 to 500 (again, number of levels) so, the matrix should have dimension of [210,000 , 500] – Ankhnesmerira Jun 22 '17 at 09:04
  • I've tried sparse.model.matrix (model.matrix doesn't even work and gives me too large size error) but the resultant matrix is [5.8M x 500] – Ankhnesmerira Jun 22 '17 at 09:07

2 Answers2

2

Try this:

spmat<-Matrix(0,nrow = 210000 ,ncol = 500,sparse = T)
locs<-Matrix(data=c(mydata$Var_1,mydata$Var_2),byrow=F,ncol=2)
spmat[locs]=1
tushaR
  • 3,083
  • 1
  • 20
  • 33
0

Why do you want a sparse matrix? For a dummy matrix you can also just use:

model.matrix(~ . + 0, data = df)

The 0 indicates no intercept and the . indicates that all categorical variables will be transformed. Be sure to set these variables as factors using as.factor() beforehand.

Lukas Heeren
  • 149
  • 1
  • 10
  • I'm using sparse matrix because the matrix variable gets large(size-wise). also, model.matrix doesn't work with that format. it doesn't give me a matrix of 200,000 rows and 500 columns. (rows of Var1 and columns of levels of Var2) – Ankhnesmerira Jun 22 '17 at 07:32
  • Ah, got it. Then you can try: sparse.model.matrix( ~. -1, data = df) – Lukas Heeren Jun 22 '17 at 07:37
  • That doesn't work either. The matrix is not in the right format and instead it creates a matrix of over 11 million rows with 2 columns. – Ankhnesmerira Jun 22 '17 at 07:41
  • i had already used model.matrix and sparse.model.matrix but i don't get the right [210,000 x 500] matrix – Ankhnesmerira Jun 22 '17 at 08:03
  • I did. I converted both variables to factors and used the model.matrix which gave me this error: Error: cannot allocate vector of size 316.8 Gb then i used sparsed.model.matrix and it gave me a wacko matrix of some ridiculous dimensions – Ankhnesmerira Jun 22 '17 at 08:18