1

I turned about 50,000 rows of varchar data into a corpus, and then proceeded to clean said corpus using the TM package, getting ride of stopwords, punctuation, and numbers.

I then turned it into a TermDocumentMatrix and used the functions findFreqTerms and findMostFreqTerms to run text analysis. findMostFreqTerms return the common words, and the number of times it shows up in the data.

However, I want to use a function that says search for "word" and return how many times "word" appears in the TermDocumentMatrix.

Is there a function in TM that achieves this? Do I have to change my data to a data.frame and use a different package & function?

George
  • 317
  • 2
  • 4
  • 16

1 Answers1

2

Since you have not given a reproducible example, I will give one using the crude dataset available in the tm package.

You can do it in (at least) 2 different ways. But anything that turns a sparse matrix into a dense matrix can use a lot of memory. So I will give you 2 options. The first one is more memory friendly as it makes use of the sparse tdm matrix. The second one, first transforms the tdm into a dense matrix before creating a frequency vector.

library(tm)
data("crude")
crude <- as.VCorpus(crude)
crude <- tm_map(crude, stripWhitespace)
crude <- tm_map(crude, removePunctuation)
crude <- tm_map(crude, content_transformer(tolower))
crude <- tm_map(crude, removeWords, stopwords("english"))


tdm <- TermDocumentMatrix(crude)

# Making use of the fact that a tdm or dtm is a simple_triplet_matrix from slam
my_func <- function(data, word){
  slam::row_sums(data[data$dimnames$Terms == word, ])
}

my_func(tdm, "crude")
crude 
   21 
my_func(tdm, "oil")
oil 
 85

# turn tdm into dense matrix and create frequency vector. 
freq <- rowSums(as.matrix(tdm))
freq["crude"]
crude 
   21 
freq["oil"]
oil 
 85 

edit: As requested in comment:

# all words starting with cru. Adjust regex to find what you need.
freq[grep("^cru", names(freq))]
crucial   crude 
      2      21 

# separate words
freq[c("crude", "oil")]
crude   oil 
   21    85 
phiver
  • 23,048
  • 14
  • 44
  • 56
  • apologies for not providing sample data, but converting tdm into a dense metrix and creating frequency vector worked. Thank you. Is it possible to search for multiple iterations of words at once with wildcards, e.g. freq ["crude%"]? Is it possible to get count of two separate words, e.g. freq["crude" , "oil"]? – George May 08 '18 at 14:58
  • @George, I edited my answers to include these 2 examples – phiver May 08 '18 at 15:41
  • fantastic! Last question, in order to search for bigrams and trigrams (2 or 3 consecutive words), do I have to use RWeka package? or in current form of the data, can I use the freq function to search freq["not fit"] – George May 08 '18 at 17:55
  • No you do not need to use RWeka if you do not want to (saves on rjava dependencies). [Here](https://stackoverflow.com/questions/50114633/dict-function-for-ngrams/50114876#50114876) you can find an answer of me that just uses NLP to create bigrams. Change the 2 into a 3 and you get trigrams. The advantage of RWeka is that you can create a tokenizer that does bigram and trigram in one go. freq["not fit"] does currently not exist since the default is to separate each word. – phiver May 08 '18 at 18:12
  • Hey @phiver. how can you search for multiple words with wildcards using grep function? this syntax isn't working. Basically, you're second example, but if crude and oil had wildcards freq[grep("does", names(text)), grep("align", names(text))] – George May 08 '18 at 22:10
  • regex procedures :-). use e.g. grep("^cru|^oil", .....). Check some regex tutorials /websites on how to use the magic power of regex :-) – phiver May 09 '18 at 07:57
  • For me solution 2 (the dense matrix's) didn't work with my dataset due to memory limitations. – Rafs Nov 09 '20 at 12:29