I am using R's tm library to look at term frequencies in a corpus. Ultimately I want to replicate a td-idf term-weighting scheme found on pg 42 of this paper. Here is my code so far:
setwd("C:/Users/George/Google Drive/Agility")
library(tm)
cname <- ("C:/Users/George/Google Drive/R Templates/corpus")
corpus <- Corpus(DirSource(cname))
#Cleaning
corpus <- tm_map(corpus, tolower)
corpus <- tm_map(corpus, removeNumbers)
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, stripWhitespace)
corpus <- tm_map(corpus, removeWords, c("a","the","an","that","and"))
#Create a term document matrix
tdm1 <- TermDocumentMatrix(corpus)
m1 <- as.matrix(tdm1)
#run frequencies
word.freq <- sort(rowSums(m1), decreasing=T)
#convert matrix to dataframe
frequencies<-as.data.frame(as.table(word.freq))
print(frequencies)
This works well enough, giving me a list of terms sorted by # of times they appear in the entire corpus:
Var1 Freq
1 him 1648
2 unto 1486
3 they 1168
4 them 955
5 not 940
But what if instead of getting an aggregate count of frequencies, I want to count the number of documents in the corpus containing that term -- regardless of the number of times it was used in that document?
That first entry, for example, might how not that the word 'him' was used 1648 times, but that it appeared in 25 of the corpus's documents.
Thank you