How to implement a backup tokenizer switch in RWeka?

Question

I am using R-tm-Rweka packages to do some text mining. Instead of building a tf-tdm on single words, which is not enough for my purposes, i have to extract ngrams. I used @Ben function TrigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 3)) tdm <- TermDocumentMatrix(a, control = list(tokenize = TrigramTokenizer))
to extract trigrams. The output has an apparent error, see below. It picks up 4-, 3- and 2-word phrases. Ideally, it should have ONLY picked up the 4-word noun phrase and dropped the (3- and 2-word)rest. How do I force this solution, like Python NLTK has a backup tokenizer option?

abstract strategy ->this is incorrect>
abstract strategy board ->incorrect
abstract strategy board game -> this should be the correct output

accenture executive
accenture executive simple
accenture executive simple comment

Many thanks.

Say: try a 4-gram first, then try a 3-word window then a 2-word window, failing a single word. But report only the largest relevant item (don't repeat 4-, 3-,2-word phrases) — Pradeep, Jun 10 '16 at 15:02

score 0 · Accepted Answer · edited May 23 '17 at 12:22

0

I think you were very close with the attempt that you made. Except that you have to understand that what you were telling Weka to do was to capture 2-gram and 3-gram tokens; that's just how Weka_control was specified.

Instead I'd recommend to use the different token sizes in different tokenizers and select or merge the results according to your preference or decision rule.

I think it would be worth checking out this great tutorial on n-gram wordclouds.

A solid code snippet for n-gram text mining is:

# QuadgramTokenizer ####
QuadgramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 4, max = 4)

for 4-grams,

# TrigramTokenizer ####
TrigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 3, max = 3)

For 3-grams, and of course

# BigramTokenizer ####
BigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2)

for 2-grams.

You might be able to avoid your earlier problem by running the different gram sizes separately like this instead of setting Weka_control to a range.

You can apply the tokenizer like this:

tdm.ng <- TermDocumentMatrix(ds5.1g, control = list(tokenize = BigramTokenizer))
dtm.ng <- DocumentTermMatrix(ds5.1g, control = list(tokenize = BigramTokenizer))

If you still have problems please just provide a reproducible example and I'll follow up.

edited May 23 '17 at 12:22

Community

1
1

answered Jun 10 '16 at 15:08

Hack-R

22,422
14
75
131

Thanks Hack-R. I ran this snippet. The problem has moved elsewhere, instead of getting the term document matrix to reflect 4-,-3,-2 gram phrases, i am still getting single words. Example: 'deutsche' is valid but 'deutche bank' (which I want recognised as a valid bigram) is not recognised. where is the issue in my understanding? – Pradeep Jun 10 '16 at 15:32
@Pradeep I see. In order for me to be able to help I need a reproducible example, so that I can see the problem on my computer and try somethings to see what fixes it. Can you use a built-in or public data set and then paste the full code to reproduce this problem? – Hack-R Jun 10 '16 at 15:37
i will post a reproducible data set. Meanwhile, I checked out [@Ben](http://stackoverflow.com/questions/16836022/findassocs-for-multiple-terms-in-r). I think i can define my problem more precisely: Instead of NGram route, I just want term frequencies of identified/specified (4-, 3-, 2-words); the vectors being stored as columns in a term document matrix. Like Deloitte Haskin Sells Price Waterhouse Lexis Nexis and so on. – Pradeep Jun 10 '16 at 15:51
both `tm` and `Weka:: NGramTokenizer` are terrible (ram consumption and implementation). Check `text2vec` as much more efficient alternative. – Dmitriy Selivanov Jun 20 '16 at 13:35
@Hack-R : Reverting to the main question: as suggested above, I create a list of (bigrams, trigrams and 4-grams). Problem is to superset this list such that (a) discard a bigram if it exists in the trigram, AND (b) add the bigram count to the trigram count. In above 4-gram example, here are the counts: abstract strategy 8 abstract strategy board 3 abstract strategy board game 2. I should be able to chain the above as "abstract strategy board game" (largest n-gram, because items 1 & 2 are already included) with count 8+3+2=13. Any help will be useful. – Pradeep Jul 29 '16 at 14:53

How to implement a backup tokenizer switch in RWeka?

1 Answers1