I'm looking for specific n-grams in a corpus. Let's say I want to find 'asset management' and 'historical yield' in a collection of documents.
This is how I loaded the corpus
my_corpus <- VCorpus(DirSource(directory, pattern = ".pdf"),
readerControl = list(reader = readPDF)
I cleaned the corpus and did some basic calculations using document term matrices. Now I want to look for particular expressions and put them in a dataframe. This is what I use (thanks to phiver):
ngrams <- c('asset management', 'historical yield')
dtm_ngrams <- DocumentTermMatrix(my_corpus, control = list(dictionary = ngrams))
df_ngrams <- data.frame(Docs = dtm$dimnames$Docs, as.matrix(dtm_ngrams), row.names = NULL )
This code runs, but the result is 0 for both n-grams. So, I'm guessing the problem is that the library is not defined correctly because R doesn't pick up the space between the words. So far, I tried to put '' between the words, or [:space:] and some other solutions.