0

i'm doing text analisys using R and i created a TermDocumentMatrix using the tm library, obtaining a dtm object whit the follow characteristic:

<<DocumentTermMatrix (documents: 16405, terms: 13002796)>>
Non-/sparse entries: 46650312/213264218068
Sparsity           : 100%
Maximal term length: 2179
Weighting          : term frequency (tf)

that has a size of 1.5 Gb. Now, i want to obtain the frequency of the words, and to do this i have to trasform the tdm into a matrix, using the command:

freq <- colSums(as.matrix(dtm))

but when i call the function, the program respond with the follow exception:

Error: cannot allocate vector of size 1589.3 Gb

First, why the programm need 1589.3 Gb to store a dtm that has size of 1.5 Gb? Second, how can i solve the problem? Thank to everyone.

Pietro Gerace
  • 45
  • 1
  • 10
  • So you want to get frequency of words ? Did you create TDM for that purpose only? What is it you want to do with TDM ? – YOLO Mar 19 '18 at 16:27
  • mmmmm yes...why? – Pietro Gerace Mar 19 '18 at 20:31
  • Because that's not required. You can directly get the frequency of words without going through this memory intensive process. Check for text2vec or tidytext R package. – YOLO Mar 19 '18 at 20:33

0 Answers0