4

I am trying to create a market basket matrix from data that looks like the following:

input <- matrix( c(1000001,1000001,1000001,1000001,1000001,1000001,1000002,1000002,1000002,1000003,1000003,1000003,100001,100002,100003,100004,100005,100006,100002,100003,100007,100002,100003,100008), ncol=2)

This represents the folowing data:

colnames(input) <- c( "Customer" , "Product" )

From this a matrix is created which has a customer as a row and all the products as columns. This can be achieved by first creating this matrix with zero's:

input <- as.data.frame(input)
m <- matrix(0, length(unique(input$Customer)), length(unique(input$Product)))
rownames(m) <- unique(input$Customer)
colnames(m) <- unique(input$Product)

This is all fast enough (have data of 750 000+ rows, creating a 15000 by 1500 matrix), but now I want to fill the matrix where appropriate:

for( i in 1:nrow(input) ) {
    m[ as.character(input[i,1]),as.character(input[i,2])] <- 1
}

I think there has to be a more efficient way to do this, as I learned from stackoverflow that for loops can often be avoided. So the question is, is there a faster way?

And i need the data in a matrix because i would like to use packages like caret. And after that i will be probably running into the same problem as here R memory management advice (caret, model matrices, data frames), but that's a concern for later.

Cœur
  • 37,241
  • 25
  • 195
  • 267
Freddy
  • 419
  • 8
  • 16

3 Answers3

3

You don't really need reshape2 for this; table is what you are looking for.

m1 <- as.matrix(as.data.frame.matrix(table(input)))

all.equal(m, m1)
TRUE
shadow
  • 21,823
  • 4
  • 63
  • 77
2

The reshape2 package has a casting function that'll do the job:

require(reshape2)
m <- acast(input, Customer ~ Product,function(x) 1,fill=0)
m

gives me

        100001 100002 100003 100004 100005 100006 100007 100008
1000001      1      1      1      1      1      1      0      0
1000002      0      1      1      0      0      0      1      0
1000003      0      1      1      0      0      0      0      1

I hope this is what you were looking for?

1

You can use a sparse matrix:

library(Matrix)
input <- as.data.frame(apply(input,2,as.character))
m <- sparseMatrix( 
  i = as.numeric( input[,1] ),
  j = as.numeric( input[,2] ),
  x = 1,
  dim = c( length(levels(input[,1])), length(levels(input[,2])) ),
  dimnames = list( levels(input[,1]), levels(input[,2]) )
)
m
# 3 x 8 sparse Matrix of class "dgCMatrix"
#         100001 100002 100003 100004 100005 100006 100007 100008
# 1000001      1      1      1      1      1      1      .      .
# 1000002      .      1      1      .      .      .      1      .
# 1000003      .      1      1      .      .      .      .      1
Vincent Zoonekynd
  • 31,893
  • 5
  • 69
  • 78
  • This Matrix will use less memory i presume, but can you work with this matrix in combination with packages like caret and party? (Because i thought that wouldn't work) – Freddy Jul 18 '13 at 13:09
  • You should try, but I think it should work: all the operations are overloaded. What can happen, though, is that those packages use the matrix to build other matrices -- these could be normal (dense) matrices. If it does not work, you can always transform a sparse matrix to a dense one with `as.matrix`. – Vincent Zoonekynd Jul 18 '13 at 13:20
  • Thank you, I will definitely look into it. Because my matrix has alot of 0's, it will save alot of memory. If I had more reputation I would vote your comment ;) – Freddy Jul 18 '13 at 13:28