5

I've created a TermDocumentMatrix from the tm library in R. It looks something like this:

> inspect(freq.terms)

A document-term matrix (19 documents, 214 terms)

Non-/sparse entries: 256/3810
Sparsity           : 94%
Maximal term length: 19 
Weighting          : term frequency (tf)

Terms
Docs abundant acid active adhesion aeropyrum alternative
  1         0    0      1        0         0           0
  2         0    0      0        0         0           0
  3         0    0      0        1         0           0
  4         0    0      0        0         0           0
  5         0    0      0        0         0           0
  6         0    1      0        0         0           0
  7         0    0      0        0         0           0
  8         0    0      0        0         0           0
  9         0    0      0        0         0           0
  10        0    0      0        0         1           0
  11        0    0      1        0         0           0
  12        0    0      0        0         0           0
  13        0    0      0        0         0           0
  14        0    0      0        0         0           0
  15        1    0      0        0         0           0
  16        0    0      0        0         0           0
  17        0    0      0        0         0           0
  18        0    0      0        0         0           0
  19        0    0      0        0         0           1

This is just a small sample of the matrix; there are actually 214 terms that I'm working with. On a small scale, this is fine. If I want to convert my TermDocumentMatrix into an ordinary matrix, I'd do:

data.matrix <- as.matrix(freq.terms)

However the data that I've displayed above is just a subset of my overall data. My overall data has probably at least 10,000 terms. When I try to create a TDM from the overall data, I run an error:

> Error cannot allocate vector of size n Kb

So from here, I'm looking into alternative ways of finding efficient memory allocation for my tdm.

I tried turning my tdm into a sparse matrix from the Matrix library but ran into the same problem.

What are my alternatives at this point? I feel like I should be investigating one of:

  • bigmemory/ff packages as talked about here (although the bigmemory package doesn't seem available for Windows at the moment)
  • the irlba package for computing partials SVD of my tdm as mentioned here

I've experimented with functions from both libraries but can't seem to arrive at anything substantial. Does anyone know what the best way forward is? I've spent so long fiddling around with this that I thought I'd ask people who have much more experience than myself working with large datasets before I waste even more time going in the wrong direction.

EDIT: changed 10,00 to 10,000. thanks @nograpes.

Community
  • 1
  • 1
user1988898
  • 185
  • 2
  • 9
  • 1
    I imagine you meant 10,000 terms. How many documents are you looking at? I think it would be easiest to do some preprocessing here: before you create the full matrix, cut out some of the really rare terms. Then you can cut out the terms that have low correlation with whatever you are trying to pull out of the data. – nograpes Feb 10 '14 at 22:40
  • @nograpes yes 10,000 terms I'll edit it now. Having done some further reading (in particular [here](http://stackoverflow.com/questions/6860715/converting-a-document-term-matrix-into-a-matrix-with-lots-of-data)) I think that you're right; the only way to proceed is to drop some of the non-essential terms from my matrix. I guess my concern is that in the future I may be using even larger data sets; what happens when at least 10,000 of my terms are essential (not sparse)? Either way thank you for commenting. – user1988898 Feb 10 '14 at 22:52
  • How many nonzero entries in the co-occurrence set? I've had good results with the Matrix (sparse) package with N around 20Mil IIRC. I'm planning to try using data.table next time around though. – Clayton Stanley Feb 11 '14 at 06:58
  • Perhaps it would be easiest (for us) if you wrote some code that would generate some random documents with roughly the characteristics you are seeing. Then, we can test some solutions and you can show us what you have tried with the sparse matrix solution. Honestly, a 10,000 term sparse matrix shouldn't be that big for documents with "typical" (meaning what I have seen in my limited experience) sparsity. – nograpes Feb 11 '14 at 15:20

1 Answers1

1

The package qdap seems to be able to handle a problem this large. The first part is recreating a data set that matches the OP's problem followed by the solution. As of qdap version 1.1.0 there is compatibility with the tm package:

library(qdapDictionaries)

FUN <- function() {
   paste(sample(DICTIONARY[, 1], sample(seq(100, 10000, by=1000), 1, TRUE)), collapse=" ")
}

library(qdap)
mycorpus <- tm::Corpus(tm::VectorSource(lapply(paste0("doc", 1:15), function(i) FUN())))

This gives a similar corpus...

Now the qdap approach. You have to first convert the Corpus to a dataframe (tm_corpus2df) and then use the tdm function to create a TermDocumentMatrix.

out <- with(tm_corpus2df(mycorpus), tdm(text, docs))
tm::inspect(out)

## A term-document matrix (19914 terms, 15 documents)
## 
## Non-/sparse entries: 80235/218475
## Sparsity           : 73%
## Maximal term length: 19 
## Weighting          : term frequency (tf)
Tyler Rinker
  • 108,132
  • 65
  • 322
  • 519
  • I face a similar problem as OP. When I use the `tm` package and inspect a corpus ( `match_names <- inspect(DocumentTermMatrix(docs, list(dictionary = names)))` ), I run out of RAM. If I use your code (updated to the new version of QDAP), I generate `out <- with(as.data.frame(mycorpus), as.dtm(match_names, names))` , I find that in the `dtm` a lot of words occur that are not in the original `names` character vector dictionary. Am I doing something wrong? – wake_wake May 01 '16 at 08:02