I've struggled with the RWeka package, specifically with the NGramTokenizer function to make bigrams. From scouring the internet, I've seen one or two other users with the same issue but no solution (that works for me).
Below is an example: 2-gram and 3-gram instead of 1-gram using RWeka
So running:
library(RWeka)
library(tm)
as.matrix(TermDocumentMatrix(Corpus(VectorSource(c(txt1 = "This is my house",
txt2 = "My house is green"))),
list(tokenize = function(x) NGramTokenizer(x,
Weka_control(min=2,
max=2)),
tolower = TRUE)))
I get:
Docs
Terms txt1 txt2
house 1 1
this 1 0
green 0 1
- Note no bigrams, just unigrams (house, this, green).
I've tried it on a volatile corpus with the tokenizer function split out as well as how I learnt from a DataCamp course, but get the below issue instead.
Error in .jcall("RWekaInterfaces", "[S", "tokenize", .jcast(tokenizer,
: java.lang.NullPointerException Called from: .jcheck()
There were other work around solutions I saw on the internet that ran fine, but still resulted in unigrams like above.
Running Java 1.8 and R 3.4.3 both 64bit on a 64bit Windows OS.
I tried installing older versions of RWeka, but on trying an old install of tm, it came up with errors, so I couldn't make that work for me (used versions referenced by LukeA in the SO thread linked at the start of this question).