R tm package Upgrade - Error in converting corpus to data frame

Question

Something seems to be gone wrong in the latest tm upgrade. My code as below with test data -

data = c('Lorem ipsum dolor sit amet, consectetur adipiscing elit',
           'Vestibulum posuere nisl vel lobortis vulputate',
           'Quisque eget sem in felis egestas sagittis')
ccorpus_clean = Corpus(VectorSource((data)))
ccorpus_clean = tm_map(ccorpus_clean,removePunctuation,lazy=TRUE)
ccorpus_clean = tm_map(ccorpus_clean,stripWhitespace,lazy=TRUE)
ccorpus_clean = tm_map(ccorpus_clean,tolower,lazy=TRUE)
ccorpus_clean = tm_map(ccorpus_clean,removeNumbers,lazy=TRUE)
ccorpus_clean = tm_map(ccorpus_clean,stemDocument,lazy=TRUE)
ccorpus_clean = tm_map(ccorpus_clean,removeWords,stopwords("english"),lazy=TRUE)
ccorpus_clean = tm_map(ccorpus_clean,removeWords,c("hi"),lazy=TRUE)
ccorpus_clean = tm_map(ccorpus_clean,removeWords,c("account","can"),lazy=TRUE)     
ccorpus_clean = tm_map(ccorpus_clean,PlainTextDocument,lazy=TRUE)
ccorpus_clean = tm_map(ccorpus_clean,stripWhitespace,lazy=TRUE);
ccorpus_clean;
df = data.frame(text=unlist(sapply(ccorpus_clean , `[[`, "content")), stringsAsFactors=FALSE)

Everything was working fine earlier. But suddenly i needed to use ",lazy=TRUE". Without that the corpus transformations stopped working. The lazy problem is documented here - R tm In mclapply(content(x), FUN, ...) : all scheduled cores encountered errors in user code

With Lazy, the transformations work, but the conversion of the corpus back to Data Frame stopped with the below error -

ccorpus_clean = tm_map(ccorpus_clean,stripWhitespace,lazy=TRUE)
ccorpus_clean

<>
Metadata: corpus specific: 0, document level (indexed): 0
Content: documents: 5

df = data.frame(text=unlist(sapply(ccorpus_clean , `[[`, "content")), stringsAsFactors=FALSE)

Error in UseMethod("meta", x) :
no applicable method for 'meta' applied to an object of class "try-error"
In addition: Warning message:
In mclapply(x$content[i], function(d) tm_reduce(d, x$lazy$maps)) :
all scheduled cores encountered errors in user code

Edit - This too fails

data.frame(text = sapply(ccorpus_clean, as.character), stringsAsFactors = FALSE)

Error in UseMethod("meta", x) : no applicable method for 'meta' applied to an object of class "try-error"

R Version - version.string R version 3.2.3 (2015-12-10) / tm - 0.6-2

does it help when you put `options(mc.cores=1)` in your .Rprofile file (and restart R) ? — knb, May 23 '16 at 12:08

score 1 · Accepted Answer · answered May 20 '16 at 16:08

1

Looks very complicated. How about:

data <- c("Lorem ipsum dolor sit amet account: 999 red balloons.",
          "Some English words are just made for stemming!")

require(quanteda)

# makes the texts into a list of tokens with the same treatment
# as your tm mapped functions
toks <- tokenize(toLower(data), removePunct = TRUE, removeNumbers = TRUE)
# toks is just a named list
toks
## tokenizedText object from 2 documents.
## Component 1 :
## [1] "lorem"    "ipsum"    "dolor"    "sit"      "amet"     "account"  "red"      "balloons"
## 
## Component 2 :
## [1] "some"     "english"  "words"    "are"      "just"     "made"     "for"      "stemming"

# remove selected terms
toks <- removeFeatures(toks, c(stopwords("english"), "hi", "account", "can"))

# apply stemming
toks <- wordstem(toks)

# make into a data frame by reassembling the cleaned tokens
(df <- data.frame(text = sapply(toks, paste, collapse = " ")))
##                                     text
## 1 lorem ipsum dolor sit amet red balloon
## 2            english word just made stem

answered May 20 '16 at 16:08

Ken Benoit

14,454
27
50

Thanks Ken. Let me try this and get back. I hope the functionality does not change much from tm to quanteda. We have historical data that we definitely dont want to rerun. – myloginid May 23 '16 at 05:07
Same functionality but more straightforward API. And you can create a **quanteda** corpus directly from a **tm** corpus using `quanteda::corpus(yourTmCorpus)`. I'm happy to help with that. – Ken Benoit May 23 '16 at 05:34
HI Ken.. Is there a way to remove sparse terms too. I dont see any in the documentation other than "dfmSparse-class". We were doing something like this earlier - TrigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 3)) dtmmm <- DocumentTermMatrix(ccorpus_clean, control = list(tokenize = TrigramTokenizer)) dtmmm <- removeSparseTerms(dtmmm, 0.995) – myloginid May 23 '16 at 10:13
1

Yes, it's `trim()`. See `?trim` and the method for dfm objects. – Ken Benoit May 23 '16 at 10:52

score 1 · Answer 2 · edited May 23 '17 at 12:26

I had a similar problem, and it does not seem to be caused by upgrading the tm package. If you do not want to use quanteda, then nother alternative solution is to set the number of cores to 1 (instead of doing Lazy = TRUE). Not sure why but it worked for me.

corpus = tm_map(corpus, tolower, mc.cores = 1)

If you are interested in diagnosing if this problem is caused by parallel computing issues, try type in this line

getOption("mc.cores", 2L)

If it returns 2 cores, then setting cores to 1 will solve the problem. See this answer for a detailed explanation.

R tm package Upgrade - Error in converting corpus to data frame

2 Answers2