I am using R-tm-Rweka packages to do some text mining. Instead of building a tf-tdm on single words, which is not enough for my purposes, i have to extract ngrams. I used @Ben function TrigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 3))
tdm <- TermDocumentMatrix(a, control = list(tokenize = TrigramTokenizer))
to extract trigrams. The output has an apparent error, see below. It picks up 4-, 3- and 2-word phrases. Ideally, it should have ONLY picked up the 4-word noun phrase and dropped the (3- and 2-word)rest. How do I force this solution, like Python NLTK has a backup tokenizer option?
abstract strategy ->this is incorrect
>
abstract strategy board ->incorrect
abstract strategy board game -> this should be the correct output
accenture executive
accenture executive simple
accenture executive simple comment
Many thanks.