I am using the acast
function within Hadley's reshape2
package to transform a flattened dataset (queried from SQL Server) into a term-document matrix like so:
## Load packages
require("reshape2")
require("plyr")
require("RODBC")
require("lsa")
## Get flattened term-frequency data:
Terms <- read.csv(url("https://dl.dropboxusercontent.com/u/263772/flat_dtm.csv"), header = T)
names(Terms) <- c("id", "Term", "Frequency")
system.time(terms.mtrx <- acast(Terms, id ~ Term, sum, value.var = 'Frequency')) # re-cast to a term-document matrix
The issue I'm bumping up against is that the dimensions of the terms.mtrx
are very large... 40,000 rows x 17,000 columns, and the matrix is very sparse.
> head(Terms)
id Term Frequency
1 resume-108008-34530496 enterprise data 2
2 resume-108008-34530496 enterprise data warehouse 2
3 resume-108008-34530496 etl 2
4 resume-108008-34530496 facility 1
5 resume-108008-34530496 faculty 1
6 resume-108008-34530496 financial 1
>
> dim(Terms)
[1] 6139039 3
Is there a faster (less memory-intensive) way to generate this matrix??