1

I have a large corpus with over 10M documents. Whenever I try a transformation over multiple cores using mc.cores argument I get error:

Error in FUN(content(x), ...) : unused argument (mc.cores = 10)

I have 15 available cores in my current hosted r studio.

# I have a corpus
> inspect(corpus[1])
<<VCorpus>>
Metadata:  corpus specific: 0, document level (indexed): 0
Content:  documents: 1

[[1]]
<<PlainTextDocument>>
Metadata:  7
Content:  chars: 46

> length(corpus)
[1] 10255313

Watch what happens when I try to make transformations using tm_map

library(tidyverse)
library(qdap)
library(stringr)
library(tm)
library(textstem)
library(stringi)
library(SnowballC)

E.g.

> corpus <- tm_map(corpus, content_transformer(replace_abbreviation), mc.cores = 10)
Error in FUN(content(x), ...) : unused argument (mc.cores = 10)

Tried adding lazy = T

corpus <- tm_map(corpus, content_transformer(replace_abbreviation), mc.cores = 10, lazy = T) # read the documentation, still don't really get what this does

After the transformation if I go e.g.

> corpus[[1]][1] I get:
Error in FUN(content(x), ...) : unused argument (mc.cores = 10)

Whereas before I would get:

> corpus.beforetransformation[[1]][1]
$content
[1] "here is some text"

What am I doing wrong here? How can I use mc.cores argument to use more of my processors?

Reproducible example:

sometext <- c("cats dogs rabbits", "oranges banannas pears", "summer fall winter") %>% 
  data.frame(stringsAsFactors = F) %>% DataframeSource %>% VCorpus

corpus.example <- tm_map(sometext, content_transformer(replace_abbreviation), mc.cores = 2, lazy = T)
corpus.example[[1]][1]
Doug Fir
  • 19,971
  • 47
  • 169
  • 299
  • For one, extra arguments passed to `tm_map` via `...` are passed to `FUN`. So your `mc.cores` argument is being passed to `content_transformer(replace_abbreviation)`. I *think* you may need to register a cluster using the parallel package, and then use `tm_parLapply_engine` function to tell the tm package to use that cluster, but that is somewhat speculative. – Taylor H Aug 22 '17 at 15:08
  • Tried moving the mc.cores argument to content_transformer but same error. RE registering a cluster... strikes me as odd? I initially started this task by creating clusters then via another SO post was told to just use mc.cores arg instead of doing that – Doug Fir Aug 22 '17 at 15:14
  • See page 14 of the tm package documentation for more info. https://cran.r-project.org/web/packages/tm/tm.pdf – Taylor H Aug 22 '17 at 15:17

1 Answers1

2

From the tm documentation, try the following:

options(mc.cores = 10)  # or whatever
tm_parLapply_engine(parallel::mclapply)  # mclapply gets the number of cores from global options
tm_map(sometext, content_transformer(replace_abbreviation))
Taylor H
  • 436
  • 2
  • 8
  • running it just now and I see all te processors lighting up in the shell. I'm pretty exited right now! Let's see if it the outcome is as expected, give it a few minutes. What exactly is the second line doing then? – Doug Fir Aug 22 '17 at 15:30
  • `tm_parLapply_engine` sets the method tm uses for parallelization. If you pass `NULL` to it, it will just use `lapply` (no parallelism). – Taylor H Aug 22 '17 at 15:32
  • @DougFir If this answered your question, please upvote the answer and/or accept it. – G5W Aug 22 '17 at 22:01
  • @G5W once it's finished running and confirmed I will, I'm just running on a large corpus – Doug Fir Aug 23 '17 at 01:04
  • Thanks for your help @TaylorH, I had a hard time understanding the tm documentation but this got me what I need – Doug Fir Aug 23 '17 at 05:06