0

I am using R's tm library to look at term frequencies in a corpus. Ultimately I want to replicate a td-idf term-weighting scheme found on pg 42 of this paper. Here is my code so far:

setwd("C:/Users/George/Google Drive/Agility")

library(tm)

cname <- ("C:/Users/George/Google Drive/R Templates/corpus")   

corpus <- Corpus(DirSource(cname))

#Cleaning
corpus <- tm_map(corpus, tolower)
corpus <- tm_map(corpus, removeNumbers)
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, stripWhitespace)
corpus <- tm_map(corpus, removeWords, c("a","the","an","that","and"))

#Create a term document matrix
tdm1 <- TermDocumentMatrix(corpus)

m1 <- as.matrix(tdm1)

#run frequencies
word.freq <- sort(rowSums(m1), decreasing=T)

#convert matrix to dataframe
frequencies<-as.data.frame(as.table(word.freq))

print(frequencies)

This works well enough, giving me a list of terms sorted by # of times they appear in the entire corpus:

                Var1 Freq
1                him 1648
2               unto 1486
3               they 1168
4               them  955
5                not  940

But what if instead of getting an aggregate count of frequencies, I want to count the number of documents in the corpus containing that term -- regardless of the number of times it was used in that document?

That first entry, for example, might how not that the word 'him' was used 1648 times, but that it appeared in 25 of the corpus's documents.

Thank you

  • 1
    This is not a [minimally-reproducible example](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example). To me, this seems like less a question of text mining than a question of data processing and analysis -- your question really concerns the processing of matrix `m1`. Providing a subset of `m1` here, for instance with something like `dput(head(m1)` will be helpful. – Alexey Shiklomanov Apr 19 '17 at 19:00
  • 1
    In theory, `length(tdm1["him",]$v)` should yield the 25. To save mem, `tdm1` only saves the non-zero matrix entries. Those are in `v`. By counting the length, you get the number of docs. – lukeA Apr 19 '17 at 19:13
  • That does work. Thank you. –  Apr 19 '17 at 19:43

1 Answers1

0

A bootstrapped solution which gives you all the counts of documents in one shot is to simply change:

word.freq <- sort(rowSums(m1), decreasing=T)

To:

word.freq2 <- sort(rowSums(m1 > 0), decreasing=T)

While word.freq holds the frequency of term usages in the corpus, word.freq2 holds counts of documents containing each term respectively.

Rafs
  • 614
  • 8
  • 19