I have a 6GB data set of 6 million messages that I want to process - my goal is to create a Document Term Matrix for my dataset but I need to do some pre-processing (strip out HTML tags, stemming, stop-word removal, etc) first.
Here is how I am currently attempting to do all this:
library(data.table)
library(tm)
wordStem2 <- function(strings){
sapply(lapply(strsplit(stripWhitespace(strings), " "), wordStem), function(x) paste(x, collapse=" "))
}
load("data/train.RData")
sampletrainDT <- as.data.table(train)
rm(train)
setkey(sampletrainDT, Id)
object.size(sampletrainDT) # 5,632,195,744 bytes
gc()
sampletrainDT[, Body := tolower(Body)]
object.size(sampletrainDT) # 5,631,997,712 bytes, but rsession usage is at 12 GB. gc doesn't help.
gc()
sampletrainDT[, Body := gsub("<(.|\n)*?>", " ", Body)] # remove HTML tags
sampletrainDT[, Body := gsub("\n", " ", Body)] # remove \n
sampletrainDT[, Body := removePunctuation(Body)]
sampletrainDT[, Body := removeNumbers(Body)]
sampletrainDT[, Body := removeWords(Body, stopwords("english"))]
sampletrainDT[, Body := stripWhitespace(Body)]
sampletrainDT[, Body := wordStem2(Body)]
ls at each line:
ls()
[1] "sampletrainDT" "wordStem2"
Each row of sampletrainDT
is one message and the main column is Body
. The others contain metadata like docid etc etc.
This runs pretty quickly (10 mins) when I am only working with a subset of the data (10%) but it doesn't even complete if I use the full data set because I run out of RAM on this line sampletrainDT[, Body := gsub("<(.|\n)*?>", " ", Body)] # remove HTML tags
. Running gc() between the lines doesn't seem to improve the situation.
I've spent a couple of days Googling for a solution but I haven't been able to find a good solution yet so I'm interested to hear from others who have a lot of experience in this. Here are some options I am considering:
- ff or bigmemory - hard to use and not suited for text
- Databases
- Read in chunks at a time, process and append to file (better suited for Python?)
- PCorpus from tm library
- Map-reduce - done locally but hopefully in a memory friendly way
- Is R just not the tool for this?
I would like to keep this running on a single machine (16 GB laptop) instead of using a big machine on EC2. 6GB of data doesn't seem insurmountable if done properly!