I'm trying to mine a set of PDFs for specific two and three word phrases. I know this question has been asked under various circumstances and
This solution partly works. However, the list does not return strings containing more than one word.
I've tried the solutions offered in these threads here, here, for example (as well as many others). Unfortunately nothing works.
Also, the qdap library won't load and I wasted an hour trying to solve that problem, so this solution won't work either, even though it seems reasonably easy.
library(tm)
data("crude")
crude <- as.VCorpus(crude)
crude <- tm_map(crude, content_transformer(tolower))
my_words <- c("contract", "prices", "contract prices", "diamond", "shamrock", "diamond shamrock")
dtm <- DocumentTermMatrix(crude, control=list(dictionary = my_words))
# create data.frame from documenttermmatrix
df1 <- data.frame(docs = dtm$dimnames$Docs, as.matrix(dtm), row.names = NULL)
head(df1)
As you can see, the output returns "contract.prices" instead of "contract prices" so I'm looking for a simple solution to this. File 127 includes the phrase 'contract prices' so the table should record at least one instance of this.
I'm also happy to share my actual data, but I'm not sure how to save a small portion of it (it's gigantic). So for now I'm using a substitute with the 'crude' data.