16

I've found a way to use use bigrams instead of single tokens in a term-document matrix. The solution has been posed on stackoverflow here: findAssocs for multiple terms in R

The idea goes something like this:

library(tm)
library(RWeka)
data(crude)

#Tokenizer for n-grams and passed on to the term-document matrix constructor
BigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))
txtTdmBi <- TermDocumentMatrix(crude, control = list(tokenize = BigramTokenizer))

However the final line gives me the error:

Error in rep(seq_along(x), sapply(tflist, length)) : 
  invalid 'times' argument
In addition: Warning message:
In is.na(x) : is.na() applied to non-(list or vector) of type 'NULL'

If I remove the tokenizer from the last line it creates a regular tdm, so I guess the problem is somewhere in the BigramTokenizer function although this is the same example that the Weka site gives here: http://tm.r-forge.r-project.org/faq.html#Bigrams.

Community
  • 1
  • 1
ds10
  • 275
  • 2
  • 3
  • 11
  • It works fine for me; I can't reproduce your error message. You might try updating your packages and R to make sure you're using the latest version of everything. – Ben Jul 17 '13 at 20:34
  • Thanks again for your advice. I still get the error message after checking my version of R and update.packages. I wonder if this is a os problem as I often run into Java problems on OS X so maybe its effecting weka? Will try on my windows machine. I'll give this a try too: http://stackoverflow.com/questions/8898521/finding-2-3-word-phrases-using-r-tm-package – ds10 Jul 18 '13 at 10:14
  • Yes, the next step is making sure your Java installation is all in order (and this can be quite a frustrating task!). I don't use OSX, maybe it's not so bad, but Windows doesn't make it easy... – Ben Jul 18 '13 at 19:54
  • I had a look at my Java installation. I couldn't see anything out of the ordinary. Now I don't receive the error message but my Mac hangs when I try to run the code. Historically I have had problems with OS X and various bits of kit built in Java. The code snippet does however work perfectly on my Windows box. – ds10 Jul 19 '13 at 08:32
  • Seeing the same problem. Turned debug on and narrowed down this line. Works fine with default scan_tokenizer but returns NULLs even with NGramTokenizer `parallel::mclapply(corpus, FUN=termFreq, control = list(tokenize = scan_tokenizer))` – Anthony Aug 24 '13 at 01:21

2 Answers2

31

Inspired by Anthony's comment, I found out that you can specify the number of threads that the parallel library uses by default (specify it before you call the NgramTokenizer):

# Sets the default number of threads to use
options(mc.cores=1)

Since the NGramTokenizer seems to hang on the parallel::mclapply call, changing the number of threads seems to work around it.

brian.keng
  • 1,931
  • 2
  • 15
  • 11
5

Seems there are problems using RWeka with parallel package. I found workaround solution here.

The most important point is not loading the RWeka package and use the namespace in a encapsulated function.

So your tokenizer should look like

BigramTokenizer <- function(x) {RWeka::NGramTokenizer(x, RWeka::Weka_control(min = 2, max = 2))}
Dmitriy Selivanov
  • 4,545
  • 1
  • 22
  • 38
  • Is there any alternative to NGramTokenizer ? In my computer RWeka is not working due to some R / Java version issues. – harsha Apr 14 '17 at 11:20