0

I want to calculate how often specific words appear in different documents (similar to a sentiment analysis).

For this purpose, I've created a specific wordlist and I have stored all of my documents in a corpus and later in a DTM (Document-Term-matrix).

Now, I currently use the TM package in R using following formula:

Term.Frequency <- data.frame(tm_term_score(DTM.tfidf, Wordlist)) 

However, I am not really familiar with the technique used behind the formula, which is why I want to calculate the Term.Frequency myself. Additionally, the codes provides a score and not the total number of frequency.

I don't know how I can only calculate the frequency of the words in the wordlist though. Can anybody help me?

Ilya
  • 1
  • 5
  • 18
Li4991
  • 59
  • 5

1 Answers1

1

I believe a similar question has been answered for using the TM package here using the crude example dataset and counting "crude" and "oil"

You can do something similar, but less elegant/complete using base R functions as well if for some reason that interests you.

#Some words to look for
search.words = c("boring","example") 
#Our text, which we remove punctuation from and split at the spaces
txt = "The example's text was boring. It was boring, the example's text."
set = strsplit(gsub("[[:punct:]]+","",txt)," ")[[1]]

#A list containing the positions of each word in our text
set.words        = lapply(search.words,grep,set,ignore.case=T)
names(set.words) = search.words
#The number of times our words appeared
set.count        = lapply(set.words,length)

> set.count
$boring
[1] 2

$example
[1] 2
kimena
  • 45
  • 6