87

Another potential title for this post could be "When parallel processing in R, does the ratio between the number of cores, loop chunk size, and object size matter?"

I have a corpus I am running some transformations on using the tm package. Since the corpus is large I'm using parallel processing with doparallel package.

Sometimes the transformations do the task, but sometimes they don't. For example, tm::removeNumbers(). The very first document in the corpus has a content value of "n417". So if preprocessing is successful then this document will be transformed to just "n".

Sample corpus is shown below for reproduction. Here is the code block:

library(tidyverse)
library(qdap)
library(stringr)
library(tm)
library(textstem)
library(stringi)
library(foreach)
library(doParallel)
library(SnowballC)

  corpus <- (see below)
  n <- 100 # This is the size of each chunk in the loop

  # Split the corpus into pieces for looping to get around memory issues with transformation
  nr <- length(corpus)
  pieces <- split(corpus, rep(1:ceiling(nr/n), each=n, length.out=nr))
  lenp <- length(pieces)

  rm(corpus) # Save memory

  # Save pieces to rds files since not enough RAM
  tmpfile <- tempfile()
  for (i in seq_len(lenp)) {
    saveRDS(pieces[[i]],
            paste0(tmpfile, i, ".rds"))
  }

  rm(pieces) # Save memory

  # Doparallel
  registerDoParallel(cores = 12)
  pieces <- foreach(i = seq_len(lenp)) %dopar% {
    piece <- readRDS(paste0(tmpfile, i, ".rds"))
    # Regular transformations
    piece <- tm_map(piece, content_transformer(removePunctuation), preserve_intra_word_dashes = T)
    piece <- tm_map(piece, content_transformer(function(x, ...)
      qdap::rm_stopwords(x, stopwords = tm::stopwords("english"), separate = F)))
    piece <- tm_map(piece, removeNumbers)
    saveRDS(piece, paste0(tmpfile, i, ".rds"))
    return(1) # Hack to get dopar to forget the piece to save memory since now saved to rds
  }

  stopImplicitCluster()

  # Combine the pieces back into one corpus
  corpus <- list()
  corpus <- foreach(i = seq_len(lenp)) %do% {
    corpus[[i]] <- readRDS(paste0(tmpfile, i, ".rds"))
  }
  corpus_done <- do.call(function(...) c(..., recursive = TRUE), corpus)

And here is the link to sample data. I need to paste a sufficiently large sample of 2k documents to recreate and this won't let me paste that much, so please see the linked document for data.

corpus <- VCorpus(VectorSource([paste the chr vector from link above]))

If I run my code block as above with n to 200 then look at the results.

I can see that numbers remain where they should have been removed by tm::removeNumbers():

> lapply(1:10, function(i) print(corpus_done[[i]]$content)) %>% unlist
[1] "n417"
[1] "disturbance"
[1] "grand theft auto"

However, if I change the chunk size (the value of "n" variable) to 100:

> lapply(1:10, function(i) print(corpus_done[[i]]$content)) %>% unlist
[1] "n"
[1] "disturbance"
[1] "grand theft auto"

The numbers have been removed.

But, this is inconsistent. I tried to narrow it down by testing on 150, then 125 ... and found that it would/would not work between 120 and 125 chunk size. Then after iterating the function between 120:125, it would sometimes work and then not for the same chunk size.

I think maybe there's a relationship to this issue between three variables: the size of the corpus, the chunk size, and the number of cores in registerdoparallel(). I just don't know what it is.

What is the solution? Can this problem be reproduced with the linked sample corpus? I'm concerned since I can reproduce the error sometimes, other times I cannot. Changing the chunk size gives a kind of ability to see the error with remove numbers, but not always.


Update

Today I resumed my session and could not replicate the error. I created a Google Docs document and experimented with differing values for corpus size, number of cores, and chunk sizes. In each case, everything was a success. So, I tried running on full data and everything worked. However, for my sanity, I tried running again on full data and it failed. Now, I'm back to where I was yesterday.

It appears as though have run the function on a larger dataset has changed something ... I don't know what! Perhaps a session variable of some sort?

So, the new information is that this bug only happens after running the function on a very large dataset. Restarting my session did not solve the problem, but resuming the sessions after being away for several hours did.


New information:

It might be easier to reproduce the issue on a larger corpus since this is what seems to trigger the issue corpus <- do.call(c, replicate(250, corpus, simplify = F)) will create a 500k document corpus based on the sample I provided. The function may work the first time you call it but for me, it seems to fail the second time.

This issue is hard because if I could reproduce the problem I would likely be able to identify and fix it.


New information:

As there are several things happening with this function, it was hard to know where to focus on debugging efforts. I was looking at both the fact I'm using multiple temporary RDS files to save memory and also the fact that I'm doing parallel processing. I wrote two alternative versions of the script, one that still uses the rds files and breaks the corpus up but does not do parallel processing (replaced %dopar% with just %do% and also removed registerDoParallel line) and one that uses parallel processing, but does not use RDS temp files to split the small sample corpus up.

I was not able to produce the error with the single-core version of the script, only with the version that uses %dopar% was I able to recreate the issue (though the issue is intermittent, it does not always fail with dopar).

So, this issue only appears when using %dopar%. The fact I'm using temp RDS files does not appear to be part of the problem.

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
Doug Fir
  • 19,971
  • 47
  • 169
  • 299
  • I don't understand what you call a corpus. You give us only a vector of characters. – F. Privé Aug 25 '17 at 07:47
  • See this block in my post: ```docs <- (copy from link above) corpus <- VCorpus(VectorSource(docs))``` this takes the vector and turns to a corpus. So just wrap everything on the linked doc inside of ```VCorpus(VectorSource([character vector goes here]))``` – Doug Fir Aug 25 '17 at 07:48
  • 4
    @DougFir despite all my respect to `tm`, I would just recommend to 1) create you own preprocessing function (learn some regex) 2) jump to another package - `text2vec` or `quanteda` - will be much faster and easier – Dmitriy Selivanov Aug 29 '17 at 05:13
  • @DmitriySelivanov just looked at the documentation for quanteda, looks really interesting actually and I might give it a try. This looks like a doparallel issue when used with tm. If I were to use qunateda, if it does process faster, might mean I don't have to use parallel processing – Doug Fir Aug 29 '17 at 06:47
  • 1
    @DougFir give a chance to `text2vec` as well (I'm the author :-) ). Check tutorials on http://text2vec.org – Dmitriy Selivanov Aug 29 '17 at 07:04
  • @DmitriySelivanov Ok thanks for the tip! I'll take a look there too – Doug Fir Aug 29 '17 at 07:06
  • I'd give this a try to help but I am not really sure what end result you want from the input character vector in your link above. Is it simply to remove the numbers from the character data, but in a way that is parallelized? – Ken Benoit Aug 31 '17 at 00:09
  • @KenBenoit the input to the code block above is meant to be a tm corpus. The link in my Gdoc is just a character vector ```corpus <- VCorpus(VectorSource([paste the chr vector from link above])) ```. Remove numbers and the other transformations. The issue in a nutshell is that everything seems to work fine when using a single core. The transformations using tm_map on corpus work. However, when using multiple cores the tm_map transformations SOMETIMES work. It's hard to recreate since it appears random. I have noticed that the issue seems to show up after running the code block on a very l... – Doug Fir Aug 31 '17 at 06:42
  • (cont) ... on a very large corpus. So if you join the example corpus provided onto itself to make it e.g. 500k or even 1M large, you might find it works the first time. However if you try running a second time the code block might not work and the transformations will appear as none took place. This issue is particularly tricky since reproduction of it is inconsistent. It only sometimes does not work. However, this seems to only be an issue when using multiple cores, otherwise everything works fine (just slow) – Doug Fir Aug 31 '17 at 06:44
  • I asked because I think there are much easier, faster, and more scaleable methods to achieve what you want than the approach you are taking. Please state simply what you seek to do: Remove numerals from the text, got it. Anything else? – Ken Benoit Aug 31 '17 at 09:26
  • Hi @Ken. OK, I would like to remove numbers, punctuation and stopwords from my corpus. (Actually, since posting this I have started using quanteda which appears to use parallel processing (I watched the terminal when running) and everything worked beautifully with no issues. So in actual fact my immediate problem is solved thanks to quanteda. However, I would have loved to understand what was happening above, but appreciate it's likely very tricky to debug since the issue appeared somewhat sporadically and inconsistently. Out of curiosity, which solution would you have suggested? – Doug Fir Aug 31 '17 at 10:52
  • 1
    Agree with the above comments that `tm_map` is very difficult to debug, has many poorly documented idiosyncracies, and that using an `apply` approach with your own custom function vs another package is probably much better in both short and long term. – Gary Weissman Apr 24 '18 at 15:38
  • 4
    I’m voting to close this question because This question is over 3 yesrs old with no answers and the Author has been given clear alternatives in the comments and is clearly no longer interested in an answer – jamesc Jul 22 '20 at 15:03
  • I have observed inconsistent results from R using Apple Accelerate BLAS, and also using `parallel` at the same time as some older versions of OpenBLAS. Updating OpenBLAS fixed this issue for me. It's plausible that `tm` uses some BLAS functions, so this is a possible cause here as well. – webb Jun 12 '21 at 13:27
  • @webb it's the second-highest voted *unanswered* question. There are plenty answered questions voted higher – camille Aug 22 '21 at 15:13
  • I’m voting to close this question because This question is over 3 years old with no answers and the Author has been given clear alternatives in the comments and is clearly no longer interested in an answer. Also, it's the second-most upvoted unanswered R question, so gets more attention than it should. – webb Aug 25 '21 at 20:10

1 Answers1

1

If you try to overwrite your memory with a program that uses parallel processing, you should first verify that it's worth it.

For instance, check if your disk is at 80%-100% writing speed; if that is the case, then your program could also just use a single core, because it is blocked by disk writing speed anyway.

If this is not the case, I recommend you to use the debugger or ad console/GUI outputs to your program to verify that everything gets executed in the right order.

If this does not help, then I recommend that you verify that you did not mess up the program (for example one arrow points in the wrong direction).

Jeremy Caney
  • 7,102
  • 69
  • 48
  • 77
neurolover
  • 11
  • 3