1

I am using the acast function within Hadley's reshape2 package to transform a flattened dataset (queried from SQL Server) into a term-document matrix like so:

## Load packages
require("reshape2")
require("plyr")
require("RODBC")
require("lsa")

## Get flattened term-frequency data:
Terms <- read.csv(url("https://dl.dropboxusercontent.com/u/263772/flat_dtm.csv"), header = T)
names(Terms) <- c("id", "Term", "Frequency")

system.time(terms.mtrx <- acast(Terms, id ~ Term, sum, value.var = 'Frequency')) # re-cast to a term-document matrix

The issue I'm bumping up against is that the dimensions of the terms.mtrx are very large... 40,000 rows x 17,000 columns, and the matrix is very sparse.

> head(Terms)
                      id                      Term Frequency
1 resume-108008-34530496           enterprise data         2
2 resume-108008-34530496 enterprise data warehouse         2
3 resume-108008-34530496                       etl         2
4 resume-108008-34530496                  facility         1
5 resume-108008-34530496                   faculty         1
6 resume-108008-34530496                 financial         1
>
> dim(Terms)
[1] 6139039       3

Is there a faster (less memory-intensive) way to generate this matrix??

Ray
  • 3,137
  • 8
  • 32
  • 59
  • It would he helpful if you could actually create a [reproducible example](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) since we don't have access to your input. Even fake data would be fine. Are you trying to make a proper `TermDocumenMatrix` as in the class from `tm` or are you just trying to make your own array. Perhaps using `xtabs(..., sparse=TRUE)` would be a better choice than `acast` to create a proper `sparseMatrix()` object. – MrFlick Sep 30 '14 at 23:59
  • @MrFlick I have updated the original with a link to a .csv file in my public Dropbox folder. This should thus be reproducible. – Ray Oct 01 '14 at 00:04

1 Answers1

2

I'm an a system that doesn't support https in base R, so to access the data, I used

library(httr)
Terms <-content(GET("http://dl.dropboxusercontent.com/u/263772/flat_dtm.csv"))
names(Terms) <- c("id", "Term", "Frequency")

And then I compared acast and xtabs(...,sparse=TRUE)

system.time(terms.mtrx <- acast(Terms, id ~ Term, sum, value.var = 'Frequency'))
#    user  system elapsed 
#   9.253   0.199   9.662 

system.time(terms.mtrx2 <- xtabs(Frequency~id+Term, Terms, sparse=TRUE))
#    user  system elapsed 
#   0.083   0.009   0.092 

and we can see that

all(terms.mtrx == terms.mtrx2)
# [1] TRUE

so the results are the same.

MrFlick
  • 195,160
  • 17
  • 277
  • 295