I'm trying my first LDA model in R and got thrown in an error
Error in LDA(Corpus_clean_dtm, k, method = "Gibbs", control = list(nstart = nstart, : Each row of the input matrix needs to contain at least one non-zero entry
Here's my code of the model that include some standard pre-processing steps
library(tm)
library(topicmodels)
library(textstem)
df_withduplicateID <- data.frame(
doc_id = c("2095/1", "2836/1", "2836/1", "2836/1", "9750/2",
"13559/1", "19094/1", "19053/1", "20215/1", "20215/1"),
text = c("He do subjects prepared bachelor juvenile ye oh.",
"He feelings removing informed he as ignorant we prepared.",
"He feelings removing informed he as ignorant we prepared.",
"He feelings removing informed he as ignorant we prepared.",
"Fond his say old meet cold find come whom. ",
"Wonder matter now can estate esteem assure fat roused.",
".Am performed on existence as discourse is.",
"Moment led family sooner cannot her window pulled any.",
"Why resolution one motionless you him thoroughly.",
"Why resolution one motionless you him thoroughly.")
)
clean_corpus <- function(corpus){
corpus <- tm_map(corpus, stripWhitespace)
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, tolower)
corpus <- tm_map(corpus, lemmatize_strings)
return(corpus)
}
df <- subset(df_withduplicateID, !duplicated(subset(df_withduplicateID, select = ID)))
Corpus <- Corpus(DataframeSource(df))
Corpus_clean <- clean_corpus(Corpus)
Corpus_clean_dtm <- DocumentTermMatrix(Corpus_clean)
burnin <- 4000
iter <- 2000
thin <- 500
seed <-list(203,500,623,1001,765)
nstart <- 5
best <- TRUE
k <- 5
LDAresult_1683 <- LDA(Corpus_clean_dtm, k, method = "Gibbs",
control = list(nstart = nstart, seed = seed, best = best,
burnin = burnin, iter = iter, thin = thin))
After some searching, it looks like my DocumentTermMatrix may contain empty documents (previously mentioned in here and here, which resulted in this error message.
I then went on removing the empty documents, re-ran the LDA model and everything went smooth. No errors were thrown.
rowTotals <- apply(Corpus_clean_dtm , 1, sum)
Corpus_clean_dtm.new <- Corpus_clean_dtm[rowTotals >0, ]
Corpus_clean_dtm.empty <- Corpus_clean_dtm[rowTotals <= 0, ]
Corpus_clean_dtm.empty$dimnames$Docs
I went on manually looked up the row numberIDs from Corpus_clean_dtm.empty (pulled out all empty document entries) and matched the same IDs(& row number) in 'Corpus_clean', and realising that these documents aren't really 'empty' and each 'empty' document contains at least at least 20 characters.
Am I missing something here?