bigrams instead of single words in termdocument matrix using R and Rweka

Question

I've found a way to use use bigrams instead of single tokens in a term-document matrix. The solution has been posed on stackoverflow here: findAssocs for multiple terms in R

The idea goes something like this:

library(tm)
library(RWeka)
data(crude)

#Tokenizer for n-grams and passed on to the term-document matrix constructor
BigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))
txtTdmBi <- TermDocumentMatrix(crude, control = list(tokenize = BigramTokenizer))

However the final line gives me the error:

Error in rep(seq_along(x), sapply(tflist, length)) : 
  invalid 'times' argument
In addition: Warning message:
In is.na(x) : is.na() applied to non-(list or vector) of type 'NULL'

If I remove the tokenizer from the last line it creates a regular tdm, so I guess the problem is somewhere in the BigramTokenizer function although this is the same example that the Weka site gives here: http://tm.r-forge.r-project.org/faq.html#Bigrams.

It works fine for me; I can't reproduce your error message. You might try updating your packages and R to make sure you're using the latest version of everything. — Ben, Jul 17 '13 at 20:34
Thanks again for your advice. I still get the error message after checking my version of R and update.packages. I wonder if this is a os problem as I often run into Java problems on OS X so maybe its effecting weka? Will try on my windows machine. I'll give this a try too: http://stackoverflow.com/questions/8898521/finding-2-3-word-phrases-using-r-tm-package — ds10, Jul 18 '13 at 10:14
Yes, the next step is making sure your Java installation is all in order (and this can be quite a frustrating task!). I don't use OSX, maybe it's not so bad, but Windows doesn't make it easy... — Ben, Jul 18 '13 at 19:54
I had a look at my Java installation. I couldn't see anything out of the ordinary. Now I don't receive the error message but my Mac hangs when I try to run the code. Historically I have had problems with OS X and various bits of kit built in Java. The code snippet does however work perfectly on my Windows box. — ds10, Jul 19 '13 at 08:32
Seeing the same problem. Turned debug on and narrowed down this line. Works fine with default scan_tokenizer but returns NULLs even with NGramTokenizer `parallel::mclapply(corpus, FUN=termFreq, control = list(tokenize = scan_tokenizer))` — Anthony, Aug 24 '13 at 01:21

score 31 · Answer 1 · answered Nov 27 '13 at 19:09

31

Inspired by Anthony's comment, I found out that you can specify the number of threads that the parallel library uses by default (specify it before you call the NgramTokenizer):

# Sets the default number of threads to use
options(mc.cores=1)

Since the NGramTokenizer seems to hang on the parallel::mclapply call, changing the number of threads seems to work around it.

answered Nov 27 '13 at 19:09

brian.keng

1,931
2
15
11

Didn't experience the problem but in Shinyapps.io. This solved the problem. Thanks! – jadianes Aug 12 '15 at 21:05

score 5 · Answer 2 · answered Mar 26 '14 at 13:54

5

Seems there are problems using RWeka with parallel package. I found workaround solution here.

The most important point is not loading the RWeka package and use the namespace in a encapsulated function.

So your tokenizer should look like

BigramTokenizer <- function(x) {RWeka::NGramTokenizer(x, RWeka::Weka_control(min = 2, max = 2))}

answered Mar 26 '14 at 13:54

Dmitriy Selivanov

4,545
1
22
38

Is there any alternative to NGramTokenizer ? In my computer RWeka is not working due to some R / Java version issues. – harsha Apr 14 '17 at 11:20

bigrams instead of single words in termdocument matrix using R and Rweka

2 Answers2

Linked