I have a large corpus with over 10M documents. Whenever I try a transformation over multiple cores using mc.cores argument I get error:
Error in FUN(content(x), ...) : unused argument (mc.cores = 10)
I have 15 available cores in my current hosted r studio.
# I have a corpus
> inspect(corpus[1])
<<VCorpus>>
Metadata: corpus specific: 0, document level (indexed): 0
Content: documents: 1
[[1]]
<<PlainTextDocument>>
Metadata: 7
Content: chars: 46
> length(corpus)
[1] 10255313
Watch what happens when I try to make transformations using tm_map
library(tidyverse)
library(qdap)
library(stringr)
library(tm)
library(textstem)
library(stringi)
library(SnowballC)
E.g.
> corpus <- tm_map(corpus, content_transformer(replace_abbreviation), mc.cores = 10)
Error in FUN(content(x), ...) : unused argument (mc.cores = 10)
Tried adding lazy = T
corpus <- tm_map(corpus, content_transformer(replace_abbreviation), mc.cores = 10, lazy = T) # read the documentation, still don't really get what this does
After the transformation if I go e.g.
> corpus[[1]][1] I get:
Error in FUN(content(x), ...) : unused argument (mc.cores = 10)
Whereas before I would get:
> corpus.beforetransformation[[1]][1]
$content
[1] "here is some text"
What am I doing wrong here? How can I use mc.cores argument to use more of my processors?
Reproducible example:
sometext <- c("cats dogs rabbits", "oranges banannas pears", "summer fall winter") %>%
data.frame(stringsAsFactors = F) %>% DataframeSource %>% VCorpus
corpus.example <- tm_map(sometext, content_transformer(replace_abbreviation), mc.cores = 2, lazy = T)
corpus.example[[1]][1]