I am trying to look for a list of keywords in a text. Some of these keywords are n-grams. However, the TermDocumentMatrix will only find single words. I already had a look at several similar questions like this one (from which I borrowed the custom tokenizer function), this one and many more. However, none of the proposed solutions worked for me. I tried with both R 3.6.3 and R 4.1.2, with no success. Any ideas why?
Below is a minimal working example of my code:
library(RWeka)
library(tm)
# List of keywords
my_keywords <- c("cheese", "spicy salami", "sweet chili sauce")
text <- c("Just a sample text that contains the words I am looking for.",
"Words such as cheese are detected by tm, but others like spicy salami",
"or sweet chili sauce are not.")
# Create a corpus
text_corpus <- VCorpus(VectorSource(text)) # Switched from Corpus to VCorpus as suggested in some of the solutions on stackoverflow
## Custom tokenizer function
myTokenizer <- function(x) {NGramTokenizer(x, RWeka::Weka_control(min = 2, max = 3))}
matrix <- as.matrix(TermDocumentMatrix(text_corpus,
list(control = list (tokenize = myTokenizer),
dictionary = my_keywords,
list(wordLengths=c(1, Inf))
)
))
words <- sort(rowSums(matrix),decreasing=TRUE)
df <- data.frame(word = names(words), freq=words)