0

I've got a large Term Document Matrix. (6 elements, 44.3 Mb)

I need to covert it into a matrix but when trying to do it I get the magical error message: "cannot allocate 100 GBs".

Is there any package/library that allows to do this transformation in chunks?

I've tried ff and bigmemory but they do not seem to allow conversions from DTMs to Matrix.

Dario Federici
  • 1,228
  • 2
  • 18
  • 40
  • Maybe a silly question that you already have thought through, but what are your downstream operations that you want to apply on the matrix? Maybe there are also ways to get around turning the whole DTM to matrix? – Manuel Bickel Nov 21 '17 at 07:52

1 Answers1

4

Before converting to matrix, remove sparse terms from Term Document Matrix. This will reduce your matrix size significantly. To remove sparse terms, you can do as below:

 library(tm)
 ## tdm - Term Document Matrix
 tdm2 <- removeSparseTerms(tdm, sparse = 0.2)
 tdm_Matrix <- as.matrix(tdm2)

Note: I put 0.2 for sparse just for an example. You should decide that value based on your tdm.

Here are some link that would shed light on removeSparseTerms function and sparse value:

How does the removeSparseTerms in R work?

https://www.rdocumentation.org/packages/tm/versions/0.7-1/topics/removeSparseTerms

Santosh M.
  • 2,356
  • 1
  • 17
  • 29
  • Considering removal of sparse terms, you might also think about excluding terms on basis of tf-idf weighting. For DTMs this is often a reasonable option without loosing core information. – Manuel Bickel Nov 21 '17 at 07:55