I have a large corpus I'm doing transformations on with tm::tm_map()
. Since I'm using hosted R Studio I have 15 cores and wanted to make use of parallel processing to speed things up.
Without sharing a very large corpus, I'm simply unable to reproduce with dummy data.
My code is below. Short descriptions of the problem is that looping over pieces manually in the console works but doing so within my functions does not.
Function "clean_corpus" takes a corpus as input, breaks it up into pieces and saves to a tempfile to help with ram issues. Then the function iterates over each piece using a %dopar
% block. The function worked when testing on a small subset of the corpus e.g. 10k documents. But on larger corpus the function was returning NULL. To debug I set the function to return the individual pieces that had been looped over and not the re built corpus as a whole. I found that on smaller corpus samples the code would return a list of all mini corpus' as expected, but as I tested on larger samples of the corpus the function would return some NULLs.
Here's why this is baffling to me:
cleaned.corpus <- clean_corpus(corpus.regular[1:10000], n = 1000) # works
cleaned.corpus <- clean_corpus(corpus.regular[10001:20000], n = 1000) # also works
cleaned.corpus <- clean_corpus(corpus.regular[1:50000], n = 1000) # NULL
If I do this in 10k blocks up to e.g. 50k via 5 iterations everything works. If I run the function on e.g. full 50k documents it returns NULL.
So, maybe I just need to loop over smaller pieces by breaking my corpus up more. I tried this. In the clean_corpus function below parameter n is the length of each piece. The function still returns NULL.
So, if I iterate like this:
# iterate over 10k docs in 10 chunks of one thousand at a time
cleaned.corpus <- clean_corpus(corpus.regular[1:10000], n = 1000)
If I do that 5 times manually up to 50K everything works. The equivalent of doing that in one call by my function is:
# iterate over 50K docs in 50 chunks of one thousand at a time
cleaned.corpus <- clean_corpus(corpus.regular[1:50000], n = 1000)
Returns NULL.
This SO post and the one linked to in the only answer suggested it might be to do with my hosted instance of RStudio on linux where linux "out of memory killer oom" might be stopping workers. This is why I tried breaking my corpus into pieces, to get around memory issues.
Any theories or suggestions as to why iterating over 10k documents in 10 chunks of 1k works whereas 50 chunks of 1k do not?
Here's the clean_corpus function:
clean_corpus <- function(corpus, n = 500000) { # n is length of each peice in parallel processing
# split the corpus into pieces for looping to get around memory issues with transformation
nr <- length(corpus)
pieces <- split(corpus, rep(1:ceiling(nr/n), each=n, length.out=nr))
lenp <- length(pieces)
rm(corpus) # save memory
# save pieces to rds files since not enough RAM
tmpfile <- tempfile()
for (i in seq_len(lenp)) {
saveRDS(pieces[[i]],
paste0(tmpfile, i, ".rds"))
}
rm(pieces) # save memory
# doparallel
registerDoParallel(cores = 14) # I've experimented with 2:14 cores
pieces <- foreach(i = seq_len(lenp)) %dopar% {
piece <- readRDS(paste0(tmpfile, i, ".rds"))
# transformations
piece <- tm_map(piece, content_transformer(replace_abbreviation))
piece <- tm_map(piece, content_transformer(removeNumbers))
piece <- tm_map(piece, content_transformer(function(x, ...)
qdap::rm_stopwords(x, stopwords = tm::stopwords("en"), separate = F, strip = T, char.keep = c("-", ":", "/"))))
}
# combine the pieces back into one corpus
corpus <- do.call(function(...) c(..., recursive = TRUE), pieces)
return(corpus)
} # end clean_corpus function
Code blocks from above again just for flow of readability after typing function:
# iterate over 10k docs in 10 chunks of one thousand at a time
cleaned.corpus <- clean_corpus(corpus.regular[1:10000], n = 1000) # works
# iterate over 50K docs in 50 chunks of one thousand at a time
cleaned.corpus <- clean_corpus(corpus.regular[1:50000], n = 1000) # does not work
But iterating in console by calling the function on each of
corpus.regular[1:10000], corpus.regular[10001:20000], corpus.regular[20001:30000], corpus.regular[30001:40000], corpus.regular[40001:50000] # does work on each run
Note I tried using library tm functionality for parallel processing (see here) but I kept hitting "cannot allocate memory" errors which is why I tried to do it "on my own" using doparallel %dopar%
.