0

I have attempting to find the frequency per term in Martin Luther King's "I Have a Dream" speech. I have converted all uppercase letters to lowercase and I have removed all stop words. I have the text in a .txt file so I cannot display it on here. The code that reads in the file is below:

 speech <- readLines(speech.txt)

Then I performed the conversion to lowercase and removal of stop words successfully and called it:

 clean.speech 

Now I am having some issues with finding the frequency per term. I have created a corpus, inspected my corpus, and created a TermDocumentMatrix as follows:

 myCorpus <- Corpus(VectorSource(clean.speech))
 inspect(myCorpus)
 TDM <- TermDocumentMatrix(myCorpus)

Everything is fine up to this point. However, I then wrote the following code and got the warning message of:

 m < as.matrix(TDM)

 Warning Message:
 "In m < as.matrix(TDM): longer object length is not a multiple of shorter  object length

I know this is a very common warning message, so I Googled it first, but I could not find anything pertaining to frequency of terms. I proceeded to run the following text, to see if it would run with a warning message, but it did not.

 v <- sort(rowSums(m), decreasing = TRUE)
 d <- data.frame(word=names(v), freq=v)
 head(d, 15)

My goal is just to find the frequency of terms. I sincerely apologize for asking this question because I know this question gets asked a lot. I just do not understand what to change about my code. Thank you everyone I appreciate it!

phiver
  • 23,048
  • 14
  • 44
  • 56
mapleleaf
  • 758
  • 3
  • 8
  • 14
  • Try to make a [reproducible example](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example). Now we can only guess what goes wrong. – phiver Oct 20 '15 at 07:02
  • I can't post the link directly because it is from my Blackboard account through my university that requires credentials. Once I open the link I copy all of the text into a .txt file. I figured this question was a long shot, but thank you for trying! I appreciate it! :) – mapleleaf Oct 20 '15 at 13:52

2 Answers2

1

If your goal is just to find the frequency of the terms, then try this.

First, I get the "I Have a Dream" speech into a character vector:

# get the text of the speech from an HTML source, and extract the text
library(XML)
doc.html <- htmlTreeParse('http://www.analytictech.com/mb021/mlk.htm', useInternal = TRUE)
doc.text = unlist(xpathApply(doc.html, '//p', xmlValue))
doc.text = paste(doc.text, collapse = ' ')

Then, I create the document-term matrix in quanteda, removing stop words (and adding "will" since quanteda's built-in list of english stop words does not include this term). From there the topfeatures() gives you the most frequent terms and their counts.

library(quanteda)
# create a document-feature matrix
IHADdfm <- dfm(doc.text, ignoredFeatures = c("will", stopwords("english")), verbose = FALSE)
# 12 most frequent features
topfeatures(IHADdfm, 12)
## freedom      one     ring    dream      let      day    negro    today     able    every together    years 
##      13       12       12       11       10        9        8        7        7        7        6        5 
# a word cloud, if you wish
plot(IHADdfm, random.order = FALSE)

enter image description here

Ken Benoit
  • 14,454
  • 27
  • 50
0

just call findFreqTerms(), e.g. as tm::findFreqTerms(TDM, lowfreq=2, highfreq = 5) .

(The tm:: is optional - just saying that it is a built-in funciton of the tm package)

knb
  • 9,138
  • 4
  • 58
  • 85