1

I have a very large csv file (about 91 million rows so a for loop takes too long in R) of similarities between keywords that when I read into a data.frame looks like:

> df   
kwd1 kwd2 similarity  
a  b  1  
b  a  1  
c  a  2  
a  c  2 

It is a sparse list and I would like to convert it into a sparse matrix:

> myMatrix 
  a b c  
a . 1 2
b 1 . .
c 2 . .

I tried using sparseMatrix(), but converting the keyword names to integer indexes takes too much time.

Thanks for any help!

rfoley
  • 345
  • 3
  • 15

1 Answers1

2

acast from the reshape2 package will do this nicely. There are base R solutions but I find the syntax much more difficult.

library(reshape2)
df <- structure(list(kwd1 = structure(c(1L, 2L, 3L, 1L), .Label = c("a", 
"b", "c"), class = "factor"), kwd2 = structure(c(2L, 1L, 1L, 
3L), .Label = c("a", "b", "c"), class = "factor"), similarity = c(1L, 
1L, 2L, 2L)), .Names = c("kwd1", "kwd2", "similarity"), class = "data.frame", row.names = c(NA, 
-4L))

acast(df, kwd1 ~ kwd2, value.var='similarity', fill=0)

  a b c
a 0 1 2
b 1 0 0
c 2 0 0
> 

using sparseMatrix from the Matrix package:

library(Matrix)
df$kwd1 <- factor(df$kwd1)
df$kwd2 <- factor(df$kwd2)

foo <- sparseMatrix(as.integer(df$kwd1), as.integer(df$kwd2), x=df$similarity)

> foo
3 x 3 sparse Matrix of class "dgCMatrix"


foo <- sparseMatrix(as.integer(df$kwd1), as.integer(df$kwd2), x=df$similarity, dimnames=list(levels(df$kwd1), levels(df$kwd2)))

> foo 

3 x 3 sparse Matrix of class "dgCMatrix"
  a b c
a . 1 2
b 1 . .
c 2 . .
Justin
  • 42,475
  • 9
  • 93
  • 111
  • Hmm I will try this. However, will this give me a sparse matrix? Memory won't allow for a dense matrix with 0's. – rfoley Sep 11 '12 at 20:13
  • Maybe if I set drop to true it will be sparse. – rfoley Sep 11 '12 at 20:15
  • Ok, this is is like what I was doing with sparse matrix before. However, I was having a problem wit converting the keywords to integer indexes, but I was using apply and which to do what as.integer is doing here. Hopefully this will be faster! – rfoley Sep 11 '12 at 20:51
  • Don't miss the `factor()` step! that is how I am forcing the `as.integer` to work and how the `dimnames` argument works too. Also, if I've answered your question, please mark it as such by clicking the checkmark. that way others know the question has been resolved. – Justin Sep 11 '12 at 20:54
  • I am now trying to convert this matrix to a dist object, but it reports an error that the 'problem is too large'. If you any insight, I have posted another question: http://stackoverflow.com/questions/12379233/efficient-way-to-convert-csv-of-sparse-distances-to-dist-object-r Thanks for help! If there is anyway to rate your answer besides up arrowing I would be happy to do so. – rfoley Sep 11 '12 at 23:24