This is my first post so hopefully I'm following the rules. I have a corpus of documents and a list of unique 2 grams in this corpus. There are some 80,000+ unique 2-grams and some 4,000 + documents in the corpus. I'm looking to find the number of times each unique 2-gram is used in the corpus for the purpose of removing 2-grams which are used either too frequently or too infrequently. I wrote some code that I think is quite pretty, but unfortunately it seems quite slow.
for(i in 1:length(unique2)) {
count2[i]=length(which(sapply(grams2, f, x = unique2[i])))
}
unique2 is my vector of unique 2-grams, grams2 is a list of lists of length length(corpus) where each sublist is all the 2-grams contained in a particular document in the corpus. f is the function f = function(x,y){
any(x %in% y)
}
Right now it takes a little over 2 hours to find the use counts for each unique word in the corpus. I'd like to make that faster. I see two ways of doing so; 1) think of a better way to find which words are used freq/infreq than counting or 2)think of a better way to count the frequency of each word. Since 2) seems more likely I'm hoping someone can see a noobish step I'm taking which is inefficient in my code.
In keeping with the rules I found what might be answers to my question in perl How to count words in a corpus document and python How to count words in a corpus document but not in R which is the language I'm doing this analysis in.
Hope to hear from you soon and thanks a ton.
Nick
EDIT: I also found some solutions in R R Text Mining: Counting the number of times a specific word appears in a corpus? . One solutions appears to rely on table() and another on apply()+%in% which is what my code relies on. Does anyone have intuition as to whether the table() solution is faster than an apply()+%in% solution? My machine is currently running my code on the 3-gram problem so I can't benchmark it without stopping early.
SECOND EDIT: Here is some code that creates a corpus similar to the one I'm working with.
letters = c("a", "b","c", "d","e","f","g","h","i","j","k", "l","m","n")
corpus = vector(300, mode="list")
for(j in 1:length(corpus)) {
len.sent = runif(1,3, 30)
sentence = rep(0,len.sent)
for(i in 1:len.sent) {
len.word = runif(1,2,7)
samp = sample(letters, len.word, replace = TRUE)
sentence[i]=paste(samp, collapse = '')
}
corpus[[j]]=sentence
}