I am familiar with using the tm library to create a tdm and count frequencies of terms.
But these terms are all single-word.
How can do count the # of times a multi-word phrase occurs in a document and/or corpus?
EDIT:
I am adding the code I have now to improve/clarify my post.
This is pretty standard code to build a term-document matrix:
library(tm)
cname <- ("C:/Users/George/Google Drive/R Templates/Gospels corpus")
corpus <- Corpus(DirSource(cname))
#Cleaning
corpus <- tm_map(corpus, tolower)
corpus <- tm_map(corpus, removeNumbers)
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, stripWhitespace)
corpus <- tm_map(corpus, removeWords, c("a","the","an","that","and"))
#convert to a plain text file
corpus <- tm_map(corpus, PlainTextDocument)
#Create a term document matrix
tdm1 <- TermDocumentMatrix(corpus)
m1 <- as.matrix(tdm1)
word.freq <- sort(rowSums(m1), decreasing=T)
word.freq<-word.freq[1:100]
The problem is that this returns a matrix of single word terms, example:
all into have from were one came say out
397 390 385 383 350 348 345 332 321
I want to be able to search for multi-word terms in the corpus instead. So for example "came from" instead of just "came" and "from" separately.
Thank you.