RWeka NgramTokenizer

Question

I've struggled with the RWeka package, specifically with the NGramTokenizer function to make bigrams. From scouring the internet, I've seen one or two other users with the same issue but no solution (that works for me).

Below is an example: 2-gram and 3-gram instead of 1-gram using RWeka

So running:

library(RWeka) 
library(tm)

as.matrix(TermDocumentMatrix(Corpus(VectorSource(c(txt1 = "This is my house",
                                               txt2 = "My house is green"))),
                         list(tokenize = function(x) NGramTokenizer(x, 
                                                                    Weka_control(min=2, 
                                                                                 max=2)),
                              tolower = TRUE)))

I get:

       Docs
Terms   txt1 txt2
  house    1    1
  this     1    0
  green    0    1

Note no bigrams, just unigrams (house, this, green).

I've tried it on a volatile corpus with the tokenizer function split out as well as how I learnt from a DataCamp course, but get the below issue instead.

Error in .jcall("RWekaInterfaces", "[S", "tokenize", .jcast(tokenizer,
:    java.lang.NullPointerException Called from: .jcheck()

There were other work around solutions I saw on the internet that ran fine, but still resulted in unigrams like above.

Running Java 1.8 and R 3.4.3 both 64bit on a 64bit Windows OS.

I tried installing older versions of RWeka, but on trying an old install of tm, it came up with errors, so I couldn't make that work for me (used versions referenced by LukeA in the SO thread linked at the start of this question).

score 1 · Accepted Answer · answered Feb 08 '18 at 07:45

You need to use a VCorpus instead of a Corpus in order to use the NGramTokenizer.

So if you change your code to:

as.matrix(TermDocumentMatrix(VCorpus(VectorSource(c(txt1 = "This is my house",
                                                    txt2 = "My house is green"))),
                             list(tokenize = function(x) NGramTokenizer(x, 
                                                                        Weka_control(min=2, 
                                                                                     max=2)),
                                  tolower = TRUE)))

It will return:

          Docs
Terms      1 2
  house is 0 1
  is green 0 1
  is my    1 0
  my house 1 1
  this is  1 0

I totally thought I did the VCorpus on this approach - but apparently I didn't. Thank you! I'm very happy that I don't need to downgrade anything. — Shane, Feb 09 '18 at 00:09

Shane · Answer 2 · 2018-02-10T04:24:02.767

This problem had two parts to it, and I probably should have articulated it better.
1) The VCorpus element as addressed by @clemens - using just the corpus function will leave you with unigrams

2) However, after seeing that and applying the approach on my larger data set, I got the error referenced below:

Error in .jcall("RWekaInterfaces", "[S", "tokenize", .jcast(tokenizer, : java.lang.NullPointerException Called from: .jcheck()

I thought this was due to RWeka, Java or package version incompatibility issues. However, after seeing it worked fine from step 1, I concluded it must've been my dataset. On investigating and testing, I found one word answers and blanks. After cleaning for both of these, I stopped getting the error message. Note I still had to do this even if my Weka Control had settings of min=1, max =2.

RWeka NgramTokenizer

2 Answers2