0

I have a 6GB data set of 6 million messages that I want to process - my goal is to create a Document Term Matrix for my dataset but I need to do some pre-processing (strip out HTML tags, stemming, stop-word removal, etc) first.

Here is how I am currently attempting to do all this:

library(data.table)
library(tm)

wordStem2 <- function(strings){
  sapply(lapply(strsplit(stripWhitespace(strings), " "), wordStem), function(x) paste(x, collapse=" "))
}

load("data/train.RData")
sampletrainDT <- as.data.table(train)
rm(train)
setkey(sampletrainDT, Id)

object.size(sampletrainDT) # 5,632,195,744 bytes

gc()
sampletrainDT[, Body := tolower(Body)]
object.size(sampletrainDT) # 5,631,997,712 bytes, but rsession usage is at 12 GB. gc doesn't help.
gc()
sampletrainDT[, Body := gsub("<(.|\n)*?>", " ", Body)] # remove HTML tags
sampletrainDT[, Body := gsub("\n", " ", Body)] # remove \n
sampletrainDT[, Body := removePunctuation(Body)]
sampletrainDT[, Body := removeNumbers(Body)]
sampletrainDT[, Body := removeWords(Body, stopwords("english"))]
sampletrainDT[, Body := stripWhitespace(Body)]
sampletrainDT[, Body := wordStem2(Body)]

ls at each line:

ls()
[1] "sampletrainDT" "wordStem2"  

Each row of sampletrainDT is one message and the main column is Body. The others contain metadata like docid etc etc.

This runs pretty quickly (10 mins) when I am only working with a subset of the data (10%) but it doesn't even complete if I use the full data set because I run out of RAM on this line sampletrainDT[, Body := gsub("<(.|\n)*?>", " ", Body)] # remove HTML tags. Running gc() between the lines doesn't seem to improve the situation.

I've spent a couple of days Googling for a solution but I haven't been able to find a good solution yet so I'm interested to hear from others who have a lot of experience in this. Here are some options I am considering:

  1. ff or bigmemory - hard to use and not suited for text
  2. Databases
  3. Read in chunks at a time, process and append to file (better suited for Python?)
  4. PCorpus from tm library
  5. Map-reduce - done locally but hopefully in a memory friendly way
  6. Is R just not the tool for this?

I would like to keep this running on a single machine (16 GB laptop) instead of using a big machine on EC2. 6GB of data doesn't seem insurmountable if done properly!

mchangun
  • 9,814
  • 18
  • 71
  • 101
  • I find very strange you need all that RAM... I process with my 16G laptop files of 5/10 millions rows every day. Can you send a snap of your file ? – statquant Oct 17 '13 at 15:37
  • @statquant First 100 rows of my data frame. https://dl.dropboxusercontent.com/u/25747565/temp.RData – mchangun Oct 17 '13 at 15:43
  • 1
    Or you can use another language/script to clean your data, then use R for analysis. – Fernando Oct 17 '13 at 15:54
  • Which line causes the RAM to run out? Have you tried running `gc()` in between each of the lines? What is `object.size(data)`? – mrip Oct 17 '13 at 16:14
  • @mrip It runs out of RAM on this line `sampletrainDT[, Body := gsub("<(.|\n)*?>", " ", Body)] # remove HTML tags`. After the first line (`sampletrainDT[, Body := tolower(Body)]`), my rsession mem usage went from 6 GB to 12 GB. Strangely, running gc() didn't bring it back down. – mchangun Oct 17 '13 at 16:40
  • Try just using a `data.frame`, see if that works. It doesn't look like you are really using any `data.table` functionality, so a `data.frame` should be more predictable in terms of memory usage. – mrip Oct 17 '13 at 16:50
  • @mrip - Thanks for helping out. I tried the `data.frame` approach but same thing - gc() doesn't reduce the mem usage after the first line. – mchangun Oct 17 '13 at 17:22
  • Can you edit the OP to show exactly what commands you are running (from a fresh R session), and also add `gc()` and `ls()` and `object.size(trainDT)` between each line. I know that won't be reproducible, because I don't have the data, but it will be helpful in diagnosing. – mrip Oct 17 '13 at 17:24
  • @mrip Done as requested. – mchangun Oct 17 '13 at 17:40
  • What is the output of `gc()`. Are you getting rsession usage from the system monitor or from `gc()`? – mrip Oct 17 '13 at 17:58
  • Use package tm.plugin.dc –  Oct 17 '13 at 18:11
  • The moment you start talking about "cleaning HTML", my brain goes straight to http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 – Carl Witthoft Oct 17 '13 at 19:21

1 Answers1

1

I'm not sure exactly what's going on, but here are some hopefully useful tips. First, this is a function that I use to monitor which objects are taking up the memory:

lsBySize<-function(k=20,envir=globalenv()){
  z <- sapply(ls(envir=envir), function(x) object.size(get(x)))
  ret<-sort(z,T)
  if(k>0)
    ret<-ret[1:min(k,length(ret))]

  as.matrix(ret)/10^6
}

Running gc() at any time will tell you how much memory is currently being used. If sum(lsBySize(length(ls))) is not approximately equal to the amount of memory used as reported by gc(), then something strange is going on. In this case, please edit the OP to show the output from the R session of running these two commands consecutively. Also, in order to isolate this issue, it is better to work with data.frames, at least for now, because the internals of data.tables are more complicated and opaque.

mrip
  • 14,913
  • 4
  • 40
  • 58