0

I have a large corpus I'm doing transformations on with tm::tm_map(). Since I'm using hosted R Studio I have 15 cores and wanted to make use of parallel processing to speed things up.

Without sharing a very large corpus, I'm simply unable to reproduce with dummy data.

My code is below. Short descriptions of the problem is that looping over pieces manually in the console works but doing so within my functions does not.

Function "clean_corpus" takes a corpus as input, breaks it up into pieces and saves to a tempfile to help with ram issues. Then the function iterates over each piece using a %dopar% block. The function worked when testing on a small subset of the corpus e.g. 10k documents. But on larger corpus the function was returning NULL. To debug I set the function to return the individual pieces that had been looped over and not the re built corpus as a whole. I found that on smaller corpus samples the code would return a list of all mini corpus' as expected, but as I tested on larger samples of the corpus the function would return some NULLs.

Here's why this is baffling to me:

cleaned.corpus <- clean_corpus(corpus.regular[1:10000], n = 1000) # works
cleaned.corpus <- clean_corpus(corpus.regular[10001:20000], n = 1000) # also works
cleaned.corpus <- clean_corpus(corpus.regular[1:50000], n = 1000) # NULL

If I do this in 10k blocks up to e.g. 50k via 5 iterations everything works. If I run the function on e.g. full 50k documents it returns NULL.

So, maybe I just need to loop over smaller pieces by breaking my corpus up more. I tried this. In the clean_corpus function below parameter n is the length of each piece. The function still returns NULL.

So, if I iterate like this:

# iterate over 10k docs in 10 chunks of one thousand at a time
cleaned.corpus <- clean_corpus(corpus.regular[1:10000], n = 1000)

If I do that 5 times manually up to 50K everything works. The equivalent of doing that in one call by my function is:

# iterate over 50K docs in 50 chunks of one thousand at a time
cleaned.corpus <- clean_corpus(corpus.regular[1:50000], n = 1000)

Returns NULL.

This SO post and the one linked to in the only answer suggested it might be to do with my hosted instance of RStudio on linux where linux "out of memory killer oom" might be stopping workers. This is why I tried breaking my corpus into pieces, to get around memory issues.

Any theories or suggestions as to why iterating over 10k documents in 10 chunks of 1k works whereas 50 chunks of 1k do not?

Here's the clean_corpus function:

clean_corpus <- function(corpus, n = 500000) { # n is length of each peice in parallel processing

  # split the corpus into pieces for looping to get around memory issues with transformation
  nr <- length(corpus)
  pieces <- split(corpus, rep(1:ceiling(nr/n), each=n, length.out=nr))
  lenp <- length(pieces)

  rm(corpus) # save memory

  # save pieces to rds files since not enough RAM
  tmpfile <- tempfile()
  for (i in seq_len(lenp)) {
    saveRDS(pieces[[i]],
            paste0(tmpfile, i, ".rds"))
  }

  rm(pieces) # save memory

  # doparallel
  registerDoParallel(cores = 14) # I've experimented with 2:14 cores
  pieces <- foreach(i = seq_len(lenp)) %dopar% {
    piece <- readRDS(paste0(tmpfile, i, ".rds"))
    # transformations
    piece <- tm_map(piece, content_transformer(replace_abbreviation))
    piece <- tm_map(piece, content_transformer(removeNumbers))
    piece <- tm_map(piece, content_transformer(function(x, ...) 
      qdap::rm_stopwords(x, stopwords = tm::stopwords("en"), separate = F, strip = T, char.keep = c("-", ":", "/"))))
  }

  # combine the pieces back into one corpus
  corpus <- do.call(function(...) c(..., recursive = TRUE), pieces)
  return(corpus)

} # end clean_corpus function

Code blocks from above again just for flow of readability after typing function:

# iterate over 10k docs in 10 chunks of one thousand at a time
cleaned.corpus <- clean_corpus(corpus.regular[1:10000], n = 1000) # works

# iterate over 50K docs in 50 chunks of one thousand at a time
cleaned.corpus <- clean_corpus(corpus.regular[1:50000], n = 1000) # does not work

But iterating in console by calling the function on each of

corpus.regular[1:10000], corpus.regular[10001:20000], corpus.regular[20001:30000], corpus.regular[30001:40000], corpus.regular[40001:50000] # does work on each run

Note I tried using library tm functionality for parallel processing (see here) but I kept hitting "cannot allocate memory" errors which is why I tried to do it "on my own" using doparallel %dopar%.

Doug Fir
  • 19,971
  • 47
  • 169
  • 299
  • Hi thanks for commenting. I understand it's a memory issue.. but that's exactly why I went to loop route. Doesn't a loop help alleviate this by calculating in chunks rather than as a whole? – Doug Fir Aug 23 '17 at 10:12
  • Also, I did watch he script running with 1 + core via shell > top > 1. In each case there appears to be lost of memory free. – Doug Fir Aug 23 '17 at 10:19
  • Ah, I never considered that. Thing is I'm able to load the entire structure into R. the 50k sample is tiny to the full 10M doc corpus, so even the chunks should not cause a memory problem. I wonder if I should try saving all the pieces to tempfile too like I did near the top of the function – Doug Fir Aug 23 '17 at 10:30
  • Hi can you expand on this part ".packages="tm"? Yes, I can save RDS OK too. – Doug Fir Aug 23 '17 at 10:31
  • Oh I see. I'm very new to parallel processing in r but I thought that when using doparallel package that any objects are automatically exported to the workers, as opposed to using e.g. parallel::parLapply. But I'm really not sure. Anyway may I ask. Could a solution be to save each piece to RDS at the end of the dopar block and then read them all in after calling the function? – Doug Fir Aug 23 '17 at 10:36
  • If I added this line before the closing ```}``` on my dopar block ```saveRDS(piece, paste0(tmpfile, i, "completed", ".rds"))``` won't memory be unnafected since dopar stores each piece in a list by default. I would need to somehow tell it to "forget" the piece after saving to RDS – Doug Fir Aug 23 '17 at 10:40
  • Hey it's working! Wow, this my actually work on my 10M doc corpus! Feeling good over here. Thank you so much for your tip return(1) – Doug Fir Aug 23 '17 at 11:11
  • @ChiPak I think he doesn't need to export packages because he uses `registerDoParallel(cores = 14)` which (except on Windows) forks the R session (with packages if already loaded). – F. Privé Aug 23 '17 at 11:15
  • If you want to make this an answer I'll accept it, wouldn't feel right if I answer it myself on the back of your knowledge – Doug Fir Aug 23 '17 at 11:16
  • Thanks @F.Privé for the clarification on that – Doug Fir Aug 23 '17 at 11:16
  • @DougFir Can you try with `cl <- parallel::makeCluster(14); doParallel::registerDoParallel(cl); on.exit(parallel::stopCluster(cl), add = TRUE)` instead of `registerDoParallel(cores = 14)`? – F. Privé Aug 23 '17 at 11:17
  • @F.Privé OK I can try that. Is your intention I leave the ```return(1)```? More precicely what would your proposed dopar block look like? – Doug Fir Aug 23 '17 at 11:21
  • @F.Privé i tried registering the cluster by your lines of code but hit an error suggesting the libraries were not imported, "tm_map" not recognised or something to that effect – Doug Fir Aug 23 '17 at 11:29
  • Just add the `.packages="tm"` option as mentionned by @ChiPak – F. Privé Aug 23 '17 at 11:41
  • @F.Privé Ok I'll give this a try but, what does this do?! What's the purpose of adding ```on.exit(parallel::stopCluster(cl), add = TRUE)``` ? – Doug Fir Aug 23 '17 at 11:44
  • You need to stop clusters after using them. This will automatically do it at the end of your function. – F. Privé Aug 23 '17 at 12:56
  • Oh. I was reading the documentation for doparallel. If I just use ```registerDoParallel``` which seems to automatically take care of the clusters then the package automatically closes them too. See section8 on page 6. Is there a benefit to manually creating the cluster with makeCluster() if it's done automatically for you? https://cran.r-project.org/web/packages/doParallel/vignettes/gettingstartedParallel.pdf – Doug Fir Aug 23 '17 at 13:00
  • Yes, by default it uses forking on Linux and Mac. I wanted you to try with clusters. – F. Privé Aug 23 '17 at 14:41

1 Answers1

1

Summary of solution from comments

Your memory issue is likely related to corpus <- do.call(function(...) c(..., recursive = TRUE), pieces) because this still stores all of your (output) data in memory

I recommended exporting your output from each worker to a file, such as a RDS or csv file, rather than collecting it into a single data structure at the end

An additional problem (as you pointed out) is that foreach will save the output of each worker with an implied return statement (the code block in {} after dopar is treated as a function). I recommended adding an explicit return(1) before the closing } to not save the intended output into memory (which you already explicitly saved as a file).

CPak
  • 13,260
  • 3
  • 30
  • 48