1

My task is to apply LDA on the dataset of amazon reviews and get 50 topics

I have extracted the review text in a vector and now I am trying to apply LDA

I have created the dtm

matrix <- create_matrix(dat, language="english", removeStopwords=TRUE,  stemWords=FALSE, stripWhitespace=TRUE, toLower=TRUE)

<<DocumentTermMatrix (documents: 100000, terms: 174632)>>
Non-/sparse entries: 4096244/17459103756
Sparsity           : 100%
Maximal term length: 218
Weighting          : term frequency (tf)

but when I try to do this I get the following error:

lda <- LDA(matrix, 30)

Error in LDA(matrix, 30) : 
  Each row of the input matrix needs to contain at least one non-zero entry

Searched for some solutions and used slam to

    matrix1 <- rollup(matrix, 2, na.rm=TRUE, FUN = sum)

still getting the same error

I am very new to this can someone help me or suggest me some reference to study about this.It will be very helpful

There are no empty rows in my original matrix and it contains only one column that contain reviews

Abhishek Gangwar
  • 1,697
  • 3
  • 17
  • 29
  • Possible duplicate of [Remove empty documents from DocumentTermMatrix in R topicmodels?](http://stackoverflow.com/questions/13944252/remove-empty-documents-from-documenttermmatrix-in-r-topicmodels) – scoa Feb 08 '16 at 08:14
  • essentially, the error message is telling you that some of the documents are empty. You should remove those – scoa Feb 08 '16 at 08:15
  • In my original matrix there are no empty rows.After that when I make DTM and then run LDA it is giving me an error – Abhishek Gangwar Feb 08 '16 at 10:50
  • You mean in the document term matrix or in the corpus ? You may not have any empty row in your corpus, but create some when you remove stopwords. – scoa Feb 08 '16 at 10:53
  • I used matrix <- create_matrix(dat, language="english", removeStopwords=FALSE, stemWords=FALSE, stripWhitespace=TRUE, toLower=TRUE) but now also getting NA entries in DTM – Abhishek Gangwar Feb 08 '16 at 15:36

1 Answers1

1

I have been assigned with kind of similar task , I am also learning and doing , I have developed somewhat , so i am sharing my code snippet , I hope that will Help.

library("topicmodels")
library("tm")

func<-function(input){

x<-c("I like to eat broccoli and bananas.",
        "I ate a banana and spinach smoothie for breakfast.",

"Chinchillas and kittens are cute.",
"My sister adopted a kitten yesterday.",
"Look at this cute hamster munching on a piece of broccoli.")



#whole file is lowercased
#text<-tolower(x)

#deleting all common words from the text
#text2<-setdiff(text,stopwords("english"))

#splitting the text into vectors where each vector is a word..
#text3<-strsplit(text2," ")

# Generating a structured text i.e. Corpus
docs<-Corpus(VectorSource(x))

creating content transformers i.e functions which will be used to modify objects in R..

toSpace <- content_transformer(function (x , pattern ) gsub(pattern, " ", x))

#Removing all the special charecters..

docs <- tm_map(docs, toSpace, "/")
docs <- tm_map(docs, toSpace, "@")
docs <- tm_map(docs, toSpace, "\\|")
docs <- tm_map(docs, removeNumbers)

# Remove english common stopwords
docs <- tm_map(docs, removeWords, stopwords("english"))

# Remove punctuations
docs <- tm_map(docs, removePunctuation)

# Eliminate extra white spaces
docs <- tm_map(docs, stripWhitespace)

docs<-tm_map(docs,removeWords,c("\t"," ",""))

dtm<- TermDocumentMatrix(docs, control = list(removePunctuation = TRUE, stopwords=TRUE))

    #print(dtm)


freq<-colSums(as.matrix(dtm))   

print(names(freq))


ord<-order(freq,decreasing=TRUE)

write.csv(freq[ord],"word_freq.csv")

Setting parameters for LDA

        burnin<-4000
        iter<-2000
        thin<-500
        seed<-list(2003,5,63,100001,765)
        nstart<-5
        best<-TRUE

        #Number of Topics
        k<-3

# Docs to topics    
    ldaOut<-LDA(dtm,k,method="Gibbs",control=list(nstart=nstart,seed=seed,best=best,burnin=burnin,iter=iter,thin=thin))

    ldaOut.topics<-as.matrix(topics(ldaOut))
    write.csv(ldaOut.topics,file=paste("LDAGibbs",k,"DocsToTopics.csv"))
Partha Roy
  • 1,575
  • 15
  • 16