0

I am trying to get the frequency of words used in each post, and add them as columns to the training data, the code below is running appropriately for 1 word, for the 2nd word it throws this error.

Function to fetch frequency of a particular word in 1000 posts

word_frequency <- function(w){

  for(i in 2:1000){
    review_text <- paste(Train$Post[i:i], collapse=" ")
    review_source <- VectorSource(review_text)

    corpus <- Corpus(review_source)
    corpus <- tm_map(corpus, content_transformer(tolower))
    corpus <- tm_map(corpus, removePunctuation)
    corpus <- tm_map(corpus, stripWhitespace)
    corpus <- tm_map(corpus, removeWords, stopwords("english"))

    dtm <- DocumentTermMatrix(corpus)
    dtm2 <- as.matrix(dtm)
    frequency <- colSums(dtm2)
    frequency <- frequency[names(frequency) == w]
    frequency <- as.list(frequency)

    freq<-rbind(freq, frequency)
    freq.withNA <- sapply(freq, function(x) ifelse(x == "NULL", NA, x))
  } 
  return(freq)
}

Train <- Training[1:1000,]

Loping over all the words in my wordlist and cbind-ing frequency to the base data frame.

for (w in wordlist) {
  freq <- as.integer()
  new <- word_frequency(w)
  Train <- cbind(Train, new)
  print(paste("Completed word ", w, sep=""))
}
Community
  • 1
  • 1
  • Output: [1] "Completed word will" Error in data.frame(..., check.names = FALSE) : arguments imply differing number of rows: 1000, 228 In addition: There were 50 or more warnings (use warnings() to see the first 50) – Srivijay Chaparala May 23 '16 at 09:22
  • 1
    There is a lot of stuff missing here. It would be useful to produce a [minimal, reproducible example](http://stackoverflow.com/a/5963610/5805670), i.e. provide a small dataset with which you can replicate the error. Also, please provide the packages that you used for the functions (e.g. `VectorSource` seems to be from the `tm` package?). – slamballais May 23 '16 at 09:44
  • Yes, Vectorsource is from tm package, In addition I installed lot of other packages like library(rJava) library(tm) library(lsa) library(googleVis) library(NLP) library(openNLP) library(RWeka) library(magrittr) library(openNLPmodels.en) – Srivijay Chaparala May 23 '16 at 09:52
  • Hmm, it seems that something went wrong when you tried pasting the dataset.. Could you delete that comment and retry? – slamballais May 23 '16 at 09:58
  • Why not load all of the 1,000 documents into a single Corpus? That's what "corpus" means: a collection of texts. Will be far simpler and more efficient. – Ken Benoit May 23 '16 at 11:00

0 Answers0