Speeding up acast() call to create matrix

Question

I am using the acast function within Hadley's reshape2 package to transform a flattened dataset (queried from SQL Server) into a term-document matrix like so:

## Load packages
require("reshape2")
require("plyr")
require("RODBC")
require("lsa")

## Get flattened term-frequency data:
Terms <- read.csv(url("https://dl.dropboxusercontent.com/u/263772/flat_dtm.csv"), header = T)
names(Terms) <- c("id", "Term", "Frequency")

system.time(terms.mtrx <- acast(Terms, id ~ Term, sum, value.var = 'Frequency')) # re-cast to a term-document matrix

The issue I'm bumping up against is that the dimensions of the terms.mtrx are very large... 40,000 rows x 17,000 columns, and the matrix is very sparse.

> head(Terms)
                      id                      Term Frequency
1 resume-108008-34530496           enterprise data         2
2 resume-108008-34530496 enterprise data warehouse         2
3 resume-108008-34530496                       etl         2
4 resume-108008-34530496                  facility         1
5 resume-108008-34530496                   faculty         1
6 resume-108008-34530496                 financial         1
>
> dim(Terms)
[1] 6139039       3

Is there a faster (less memory-intensive) way to generate this matrix??

It would he helpful if you could actually create a [reproducible example](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) since we don't have access to your input. Even fake data would be fine. Are you trying to make a proper `TermDocumenMatrix` as in the class from `tm` or are you just trying to make your own array. Perhaps using `xtabs(..., sparse=TRUE)` would be a better choice than `acast` to create a proper `sparseMatrix()` object. — MrFlick, Sep 30 '14 at 23:59
@MrFlick I have updated the original with a link to a .csv file in my public Dropbox folder. This should thus be reproducible. — Ray, Oct 01 '14 at 00:04

score 2 · Answer 1 · answered Oct 01 '14 at 00:18

I'm an a system that doesn't support https in base R, so to access the data, I used

library(httr)
Terms <-content(GET("http://dl.dropboxusercontent.com/u/263772/flat_dtm.csv"))
names(Terms) <- c("id", "Term", "Frequency")

And then I compared acast and xtabs(...,sparse=TRUE)

system.time(terms.mtrx <- acast(Terms, id ~ Term, sum, value.var = 'Frequency'))
#    user  system elapsed 
#   9.253   0.199   9.662 

system.time(terms.mtrx2 <- xtabs(Frequency~id+Term, Terms, sparse=TRUE))
#    user  system elapsed 
#   0.083   0.009   0.092

and we can see that

all(terms.mtrx == terms.mtrx2)
# [1] TRUE

so the results are the same.

And today I learned what `xtabs` does! Thanks for this! This works great! — Ray, Oct 01 '14 at 00:36

Speeding up acast() call to create matrix

1 Answers1