I'm working with text data and I have a function to run some standard transformations. When I test this function on a sample of 10k, 100k and even 1M the function returns the desired object, a corpus with processed text data. However, when I run on full data (several million documents), the returned object is null.
I can show and describe the data but given the very nature of the problem I don't know how to create a reproducible example.
The function takes a corpus and returns a corpus. I can share a sample of data if it's deemed helpful.
I realize this is vague but I've been trying to get this to run for days now. It's frustrating because everything works as expected if I debug by iterating through the function manually line by line. It also works as expected when I run on a sample of full data, I've tried running on up to 1M records.
Some meta information if it has any value. I have hosted RStudio and when I run and debug in there everything appears to work fine. Then, to run the script on full data I ssh into the server and call the script within a screen session then leave it running for a few hours.
I tried saving the output of the function into a RDS file but the returned corpus from the function is just NULL.
Here is the relevant code block and culprit function:
library(tidyverse)
library(qdap)
library(stringr)
library(tm)
library(textstem)
library(stringi)
library(foreach)
library(doParallel)
# custom function for updating misspelt words using a lookup table (It works, have tested, everything works with this one)
stringi_spelling_update <- content_transformer(function(x, lut = spellingdoc) stri_replace_all_regex(str = x, pattern = paste0("\\b", lut[,1], "\\b"), replacement = lut[,2], vectorize_all = FALSE))
# Now the suspect function:
# corpus parameter is a corpus of over 10m documents
# n parameter is for breaking corpus up into pieces to do transformations on using parallel processing
clean_corpus <- function(corpus, n = 500000) { # n is length of each peice in parallel processing
# split the corpus into pieces for looping to get around memory issues with transformation
nr <- length(corpus)
pieces <- split(corpus, rep(1:ceiling(nr/n), each=n, length.out=nr))
# save memory
rm(corpus)
# save pieces to rds files since not enough RAM
tmpfile <- tempfile()
for (i in seq_len(length(pieces))) {
saveRDS(pieces[[i]],
paste0(tmpfile, i, ".rds"))
}
# doparallel processing using doparallel package
registerDoParallel(cores = 14)
pieces <- foreach(i = seq_len(length(pieces))) %dopar% {
piece <- readRDS(paste0(tmpfile, i, ".rds"))
# spelling update based on lut
piece <- tm_map(piece, function(i) stringi_spelling_update(i, spellingdoc))
# regular transformations
piece <- tm_map(piece, content_transformer(replace_abbreviation))
piece <- tm_map(piece, content_transformer(removeNumbers))
piece <- tm_map(piece, content_transformer(function(x, ...)
qdap::rm_stopwords(x, stopwords = tm::stopwords("en"), separate = F, strip = T, char.keep = c("-", ":", "/"))))
}
# combine the pieces back into one corpus
corpus <- do.call(function(...) c(..., recursive = TRUE), pieces)
rm(pieces)
return(corpus)
} # end clean_corpus function
I don't know if I'm looking in the wrong place. If the function works fine on smaller pieces, maybe something else is going on?
How can it be that this code works for "small" data but when I try to run on my full data I get back NULL?
Also, here is sessionInfo()
> sessionInfo()
R version 3.4.0 (2017-04-21)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 16.04.2 LTS
Matrix products: default
BLAS: /usr/lib/openblas-base/libblas.so.3
LAPACK: /usr/lib/libopenblasp-r0.2.18.so
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8
[4] LC_COLLATE=en_US.UTF-8 LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=en_US.UTF-8 LC_ADDRESS=en_US.UTF-8
[10] LC_TELEPHONE=en_US.UTF-8 LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=en_US.UTF-8
attached base packages:
[1] parallel stats graphics grDevices utils datasets methods base
other attached packages:
[1] doParallel_1.0.10 iterators_1.0.8 foreach_1.4.3 stringi_1.1.5
[5] textstem_0.0.1 tm_0.7-1 NLP_0.1-10 stringr_1.2.0
[9] qdap_2.2.5 RColorBrewer_1.1-2 qdapTools_1.3.3 qdapRegex_0.7.2
[13] qdapDictionaries_1.0.6 dplyr_0.7.1 purrr_0.2.2.2 readr_1.1.1
[17] tidyr_0.6.3 tibble_1.3.1 ggplot2_2.2.1 tidyverse_1.1.1
loaded via a namespace (and not attached):
[1] Rcpp_0.12.11 lubridate_1.6.0 lattice_0.20-35 xlsxjars_0.6.1
[5] gtools_3.5.0 assertthat_0.2.0 psych_1.7.5 slam_0.1-40
[9] R6_2.2.1 cellranger_1.1.0 plyr_1.8.4 chron_2.3-50
[13] httr_1.2.1 rlang_0.1.1 lazyeval_0.2.0 readxl_1.0.0
[17] data.table_1.10.4 gdata_2.18.0 gender_0.5.1 foreign_0.8-67
[21] igraph_1.0.1 RCurl_1.95-4.8 munsell_0.4.3 broom_0.4.2
[25] compiler_3.4.0 modelr_0.1.0 pkgconfig_2.0.1 mnormt_1.5-5
[29] reports_0.1.4 gridExtra_2.2.1 codetools_0.2-15 XML_3.98-1.9
[33] bitops_1.0-6 openNLP_0.2-6 grid_3.4.0 nlme_3.1-131
[37] jsonlite_1.4 gtable_0.2.0 magrittr_1.5 scales_0.4.1
[41] xlsx_0.5.7 reshape2_1.4.2 bindrcpp_0.2 openNLPdata_1.5.3-2
[45] xml2_1.1.1 venneuler_1.1-0 wordcloud_2.5 tools_3.4.0
[49] forcats_0.2.0 glue_1.1.1 hms_0.3 plotrix_3.6-5
[53] colorspace_1.3-2 rvest_0.3.2 rJava_0.9-8 bindr_0.1
[57] haven_1.1.0