3

I have a sparseMatrix (library Matrix) or a simple_triplet_matrix (library slam) of docs x terms, such as:

library(Matrix)
mat <- sparseMatrix(i = c(1,2,4,5,3), j = c(2,3,4,1,5), x = c(3,2,3,4,1))
rownames(mat) <- paste0("doc", 1:5)
colnames(mat) <- paste0("word", 1:5)

5 x 5 sparse Matrix of class "dgCMatrix"
     word1 word2 word3 word4 word5
doc1     .     3     .     .     .
doc2     .     .     2     .     .
doc3     .     .     .     .     1
doc4     .     .     .     3     .
doc5     4     .     .     .     .

or:

library(slam)
mat2 <- simple_triplet_matrix(c(1,2,4,5,3), j = c(2,3,4,1,5), v = c(3,2,3,4,1),
                          dimnames = list(paste0("doc", 1:5), paste0("word", 1:5)))

And I wish to turn either of these matrices into a tm::Document-Term-Matrix, without going through a Corpus/VCorpus creation.

This works only for small matrices: In R tm package, build corpus FROM Document-Term-Matrix

My matrix is quite big, ~16K x ~53K, so the list suggested there is too large for a reasonable RAM, and besides I don't see why I should go through Corpus creation where the tm package manual explicitly says a Document Term Matrix is a sparse matrix.

Any suggestions on how to convert a already sparse matrix into tm's Document Term Matrix?

Thank you.

Community
  • 1
  • 1
Giora Simchoni
  • 3,487
  • 3
  • 34
  • 72

1 Answers1

7

The documentation is admittedly a little tricky here. You can use the coercing function as.DocumentTermMatrix but not the direct constructor DocumentTermMatrix on a simple_triplet_matrix.

library(slam)
library(Matrix)
mat2 = simple_triplet_matrix(c(1,2,4,5,3), j = c(2,3,4,1,5), v = c(3,2,3,4,1),
                              dimnames = list(paste0("doc", 1:5), paste0("word", 1:5)))
mat2 = as.DocumentTermMatrix(mat2, weighting = weightTfIdf)

You can check:

> class(mat2)
[1] "DocumentTermMatrix"    "simple_triplet_matrix"
tchakravarty
  • 10,736
  • 12
  • 72
  • 116