I've created a TermDocumentMatrix
from the tm
library in R. It looks something like this:
> inspect(freq.terms)
A document-term matrix (19 documents, 214 terms)
Non-/sparse entries: 256/3810
Sparsity : 94%
Maximal term length: 19
Weighting : term frequency (tf)
Terms
Docs abundant acid active adhesion aeropyrum alternative
1 0 0 1 0 0 0
2 0 0 0 0 0 0
3 0 0 0 1 0 0
4 0 0 0 0 0 0
5 0 0 0 0 0 0
6 0 1 0 0 0 0
7 0 0 0 0 0 0
8 0 0 0 0 0 0
9 0 0 0 0 0 0
10 0 0 0 0 1 0
11 0 0 1 0 0 0
12 0 0 0 0 0 0
13 0 0 0 0 0 0
14 0 0 0 0 0 0
15 1 0 0 0 0 0
16 0 0 0 0 0 0
17 0 0 0 0 0 0
18 0 0 0 0 0 0
19 0 0 0 0 0 1
This is just a small sample of the matrix; there are actually 214 terms that I'm working with. On a small scale, this is fine. If I want to convert my TermDocumentMatrix
into an ordinary matrix, I'd do:
data.matrix <- as.matrix(freq.terms)
However the data that I've displayed above is just a subset of my overall data. My overall data has probably at least 10,000 terms. When I try to create a TDM from the overall data, I run an error:
> Error cannot allocate vector of size n Kb
So from here, I'm looking into alternative ways of finding efficient memory allocation for my tdm.
I tried turning my tdm into a sparse matrix from the Matrix
library but ran into the same problem.
What are my alternatives at this point? I feel like I should be investigating one of:
bigmemory
/ff
packages as talked about here (although thebigmemory
package doesn't seem available for Windows at the moment)- the
irlba
package for computing partials SVD of my tdm as mentioned here
I've experimented with functions from both libraries but can't seem to arrive at anything substantial. Does anyone know what the best way forward is? I've spent so long fiddling around with this that I thought I'd ask people who have much more experience than myself working with large datasets before I waste even more time going in the wrong direction.
EDIT: changed 10,00 to 10,000. thanks @nograpes.