Text analysis with dictionary of words: NGramTokenizer not working

Question

I am trying to look for a list of keywords in a text. Some of these keywords are n-grams. However, the TermDocumentMatrix will only find single words. I already had a look at several similar questions like this one (from which I borrowed the custom tokenizer function), this one and many more. However, none of the proposed solutions worked for me. I tried with both R 3.6.3 and R 4.1.2, with no success. Any ideas why?

Below is a minimal working example of my code:

library(RWeka) 
library(tm)

# List of keywords
my_keywords <- c("cheese", "spicy salami", "sweet chili sauce")

text <- c("Just a sample text that contains the words I am looking for.",
          "Words such as cheese are detected by tm, but others like spicy salami",
          "or sweet chili sauce are not.")
  
# Create a corpus  
text_corpus <- VCorpus(VectorSource(text)) # Switched from Corpus to VCorpus as suggested in some of the solutions on stackoverflow
  
## Custom tokenizer function
myTokenizer <- function(x) {NGramTokenizer(x, RWeka::Weka_control(min = 2, max = 3))}

matrix <- as.matrix(TermDocumentMatrix(text_corpus,
                                       list(control = list (tokenize = myTokenizer),
                                            dictionary = my_keywords,
                                            list(wordLengths=c(1, Inf))
                                       )
))
  
words <- sort(rowSums(matrix),decreasing=TRUE) 
df <- data.frame(word = names(words), freq=words)

score 1 · Accepted Answer · answered Sep 29 '22 at 15:56

A solution with only tm and NLP. No need for RWeka, as that uses rjava. Note that you had a mistake around the control portion of TermDocumentMatrix. You had list before control, but it should only be after control. And wordLength doesn't need a list, but should be in the control list like the other options.

The tokenizer I created will creates tokens of length 1, 2 and 3. Otherwise "cheese" will not be picked up. Adjust the length as needed.

library(tm)

# List of keywords
my_keywords <- c("cheese", "spicy salami", "sweet chili sauce")

text <- c("Just a sample text that contains the words I am looking for.",
          "Words such as cheese are detected by tm, but others like spicy salami",
          "or sweet chili sauce are not.")

# Create a corpus  
text_corpus <- VCorpus(VectorSource(text)) 

## Custom tokenizer function
myTokenizer <- function(x) {
  unlist(lapply(ngrams(words(x), 1:3), paste, collapse = " "), use.names = FALSE)
}

mat <- as.matrix(TermDocumentMatrix(text_corpus,
                                    control = list(tokenize = myTokenizer,
                                                   dictionary = my_keywords,
                                                   wordLength = c(1, Inf))
                                    ))

mat
                   Docs
Terms               1 2 3
  cheese            0 1 0
  spicy salami      0 1 0
  sweet chili sauce 0 0 1

Amazing, it worked as intended. Thank you very much! – gicanzo Sep 30 '22 at 08:09 — gicanzo, Sep 30 '22 at 08:09

Text analysis with dictionary of words: NGramTokenizer not working

1 Answers1