Finding the number of times a word is used in a corpus of documents

Question

This is my first post so hopefully I'm following the rules. I have a corpus of documents and a list of unique 2 grams in this corpus. There are some 80,000+ unique 2-grams and some 4,000 + documents in the corpus. I'm looking to find the number of times each unique 2-gram is used in the corpus for the purpose of removing 2-grams which are used either too frequently or too infrequently. I wrote some code that I think is quite pretty, but unfortunately it seems quite slow.

 for(i in 1:length(unique2)) {
        count2[i]=length(which(sapply(grams2, f, x = unique2[i])))
 }

unique2 is my vector of unique 2-grams, grams2 is a list of lists of length length(corpus) where each sublist is all the 2-grams contained in a particular document in the corpus. f is the function f = function(x,y){ any(x %in% y) }

Right now it takes a little over 2 hours to find the use counts for each unique word in the corpus. I'd like to make that faster. I see two ways of doing so; 1) think of a better way to find which words are used freq/infreq than counting or 2)think of a better way to count the frequency of each word. Since 2) seems more likely I'm hoping someone can see a noobish step I'm taking which is inefficient in my code.

In keeping with the rules I found what might be answers to my question in perl How to count words in a corpus document and python How to count words in a corpus document but not in R which is the language I'm doing this analysis in.

Hope to hear from you soon and thanks a ton.

Nick

EDIT: I also found some solutions in R R Text Mining: Counting the number of times a specific word appears in a corpus? . One solutions appears to rely on table() and another on apply()+%in% which is what my code relies on. Does anyone have intuition as to whether the table() solution is faster than an apply()+%in% solution? My machine is currently running my code on the 3-gram problem so I can't benchmark it without stopping early.

SECOND EDIT: Here is some code that creates a corpus similar to the one I'm working with.

letters = c("a", "b","c", "d","e","f","g","h","i","j","k", "l","m","n")
corpus = vector(300, mode="list")
for(j in 1:length(corpus)) {
  len.sent = runif(1,3, 30)
  sentence = rep(0,len.sent)
  for(i in 1:len.sent) {
    len.word = runif(1,2,7)
    samp = sample(letters, len.word, replace = TRUE)
    sentence[i]=paste(samp, collapse = '')
  }
  corpus[[j]]=sentence
}

A small reproducible data set would be helpful and is usually expected to illustrate and reproduce error/problems. Use tm's build in data sets. — Tyler Rinker, Jun 25 '14 at 20:26
The data is protected so I definitely can't include that. I'll code up a simple data set and include it in an edit. — Nick Thieme, Jun 25 '14 at 20:30
Seems like a good reference/comparison for what you are trying to accomplish: http://pvanb.wordpress.com/2012/06/21/cross-tables-in-r-some-ways-to-do-it-faster/ In general, I feel that you should be working with corpus objects if you are doing text mining (package `tm`) — Vlo, Jun 25 '14 at 21:04
Questions about code optimization are better suited for http://codereview.stackexchange.com. Very nice first question though, I wish we had more like that :) — merours, Jun 26 '14 at 09:09

Finding the number of times a word is used in a corpus of documents

0 Answers0