tm_map site of transformations returning an object of NULL only on large data set

Question

I'm working with text data and I have a function to run some standard transformations. When I test this function on a sample of 10k, 100k and even 1M the function returns the desired object, a corpus with processed text data. However, when I run on full data (several million documents), the returned object is null.

I can show and describe the data but given the very nature of the problem I don't know how to create a reproducible example.

The function takes a corpus and returns a corpus. I can share a sample of data if it's deemed helpful.

I realize this is vague but I've been trying to get this to run for days now. It's frustrating because everything works as expected if I debug by iterating through the function manually line by line. It also works as expected when I run on a sample of full data, I've tried running on up to 1M records.

Some meta information if it has any value. I have hosted RStudio and when I run and debug in there everything appears to work fine. Then, to run the script on full data I ssh into the server and call the script within a screen session then leave it running for a few hours.

I tried saving the output of the function into a RDS file but the returned corpus from the function is just NULL.

Here is the relevant code block and culprit function:

library(tidyverse)
library(qdap)
library(stringr)
library(tm)
library(textstem)
library(stringi)
library(foreach)
library(doParallel)

# custom function for updating misspelt words using a lookup table (It works, have tested, everything works with this one)
stringi_spelling_update <- content_transformer(function(x, lut = spellingdoc) stri_replace_all_regex(str = x, pattern = paste0("\\b", lut[,1], "\\b"), replacement = lut[,2], vectorize_all = FALSE))

# Now the suspect function:
# corpus parameter is a corpus of over 10m documents
# n parameter is for breaking corpus up into pieces to do transformations on using parallel processing
clean_corpus <- function(corpus, n = 500000) { # n is length of each peice in parallel processing

  # split the corpus into pieces for looping to get around memory issues with transformation
  nr <- length(corpus)
  pieces <- split(corpus, rep(1:ceiling(nr/n), each=n, length.out=nr))

  # save memory
  rm(corpus)

  # save pieces to rds files since not enough RAM
  tmpfile <- tempfile()
  for (i in seq_len(length(pieces))) {
    saveRDS(pieces[[i]],
            paste0(tmpfile, i, ".rds"))
  }

  # doparallel processing using doparallel package
  registerDoParallel(cores = 14)
  pieces <- foreach(i = seq_len(length(pieces))) %dopar% {
    piece <- readRDS(paste0(tmpfile, i, ".rds"))
    # spelling update based on lut
    piece <- tm_map(piece, function(i) stringi_spelling_update(i, spellingdoc))
    # regular transformations
    piece <- tm_map(piece, content_transformer(replace_abbreviation))
    piece <- tm_map(piece, content_transformer(removeNumbers))
    piece <- tm_map(piece, content_transformer(function(x, ...) 
      qdap::rm_stopwords(x, stopwords = tm::stopwords("en"), separate = F, strip = T, char.keep = c("-", ":", "/"))))
  }

  # combine the pieces back into one corpus
  corpus <- do.call(function(...) c(..., recursive = TRUE), pieces)
  rm(pieces)

  return(corpus)
} # end clean_corpus function

I don't know if I'm looking in the wrong place. If the function works fine on smaller pieces, maybe something else is going on?

How can it be that this code works for "small" data but when I try to run on my full data I get back NULL?

Also, here is sessionInfo()

> sessionInfo()
R version 3.4.0 (2017-04-21)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 16.04.2 LTS

Matrix products: default
BLAS: /usr/lib/openblas-base/libblas.so.3
LAPACK: /usr/lib/libopenblasp-r0.2.18.so

locale:
 [1] LC_CTYPE=en_US.UTF-8          LC_NUMERIC=C                  LC_TIME=en_US.UTF-8          
 [4] LC_COLLATE=en_US.UTF-8        LC_MONETARY=en_US.UTF-8       LC_MESSAGES=en_US.UTF-8      
 [7] LC_PAPER=en_US.UTF-8          LC_NAME=en_US.UTF-8           LC_ADDRESS=en_US.UTF-8       
[10] LC_TELEPHONE=en_US.UTF-8      LC_MEASUREMENT=en_US.UTF-8    LC_IDENTIFICATION=en_US.UTF-8

attached base packages:
[1] parallel  stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] doParallel_1.0.10      iterators_1.0.8        foreach_1.4.3          stringi_1.1.5         
 [5] textstem_0.0.1         tm_0.7-1               NLP_0.1-10             stringr_1.2.0         
 [9] qdap_2.2.5             RColorBrewer_1.1-2     qdapTools_1.3.3        qdapRegex_0.7.2       
[13] qdapDictionaries_1.0.6 dplyr_0.7.1            purrr_0.2.2.2          readr_1.1.1           
[17] tidyr_0.6.3            tibble_1.3.1           ggplot2_2.2.1          tidyverse_1.1.1       

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.11        lubridate_1.6.0     lattice_0.20-35     xlsxjars_0.6.1     
 [5] gtools_3.5.0        assertthat_0.2.0    psych_1.7.5         slam_0.1-40        
 [9] R6_2.2.1            cellranger_1.1.0    plyr_1.8.4          chron_2.3-50       
[13] httr_1.2.1          rlang_0.1.1         lazyeval_0.2.0      readxl_1.0.0       
[17] data.table_1.10.4   gdata_2.18.0        gender_0.5.1        foreign_0.8-67     
[21] igraph_1.0.1        RCurl_1.95-4.8      munsell_0.4.3       broom_0.4.2        
[25] compiler_3.4.0      modelr_0.1.0        pkgconfig_2.0.1     mnormt_1.5-5       
[29] reports_0.1.4       gridExtra_2.2.1     codetools_0.2-15    XML_3.98-1.9       
[33] bitops_1.0-6        openNLP_0.2-6       grid_3.4.0          nlme_3.1-131       
[37] jsonlite_1.4        gtable_0.2.0        magrittr_1.5        scales_0.4.1       
[41] xlsx_0.5.7          reshape2_1.4.2      bindrcpp_0.2        openNLPdata_1.5.3-2
[45] xml2_1.1.1          venneuler_1.1-0     wordcloud_2.5       tools_3.4.0        
[49] forcats_0.2.0       glue_1.1.1          hms_0.3             plotrix_3.6-5      
[53] colorspace_1.3-2    rvest_0.3.2         rJava_0.9-8         bindr_0.1          
[57] haven_1.1.0

1) Could you set `mc.cores = 14` in the tm_map function and then not split the corpus? 2) Do all of your chunks of data return a valid corpus? 3) `removeNumbers` does not need to be wrapped in `content_transformer` (this is just an FYI). 4) You save the pieces to disk but do not remove pieces from the environment so I do not think you are actually saving yourself any memory. — emilliman5, Aug 21 '17 at 16:27
Thank you very much for these suggestions, I will act on them now. Regarding number 1. I can set mc.cores within tm_map? So I don't even need doparallel package? — Doug Fir, Aug 21 '17 at 16:31
Correct, tm_map was designed to be run in parallel since it is just applying the same function to a bunch of individual elements. — emilliman5, Aug 21 '17 at 16:43
```Error in FUN(X[[i]], ...) : unused argument (mc.cores = 10) piece <- tm_map(corpus, content_transformer(function(x, ...) ``` — Doug Fir, Aug 21 '17 at 16:50
with ```clean_corpus <- function(corpus) { # n is length of each peice in parallel processing # spelling update based on lut corpus <- tm_map(corpus, function(i) stringi_spelling_update(i, spellingdoc), mc.cores = 10) # regular transformations piece <- tm_map(corpus, content_transformer(replace_abbreviation), mc.cores = 10) piece <- tm_map(corpus, removeNumbers, mc.cores = 10) piece <- tm_map(corpus, content_transformer(function(x, ...) ``` — Doug Fir, Aug 21 '17 at 16:51
```qdap::rm_stopwords(x, stopwords = tm::stopwords("en"), separate = F, strip = T, char.keep = c("-", ":", "/"))), mc.cores = 10) return(corpus) } # end clean_corpus function``` — Doug Fir, Aug 21 '17 at 16:51
It looks like you might now need to add `lazy=TRUE` to the `tm_map` call. I do not have big enough corpus to test on right now. These posts may also help: https://stackoverflow.com/questions/18287981/tm-map-has-parallelmclapply-error-in-r-3-0-1-on-mac?noredirect=1&lq=1, https://stackoverflow.com/questions/25069798/r-tm-in-mclapplycontentx-fun-all-scheduled-cores-encountered-errors — emilliman5, Aug 21 '17 at 18:19

tm_map site of transformations returning an object of NULL only on large data set

0 Answers0