DocumentTermMatrix /LDA produces non-zero entry error when there is no empty documents

Question

I'm trying my first LDA model in R and got thrown in an error

Error in LDA(Corpus_clean_dtm, k, method = "Gibbs", control = list(nstart = nstart,  :    Each row of the input matrix needs to contain at least one non-zero entry

Here's my code of the model that include some standard pre-processing steps

 library(tm)
 library(topicmodels)
 library(textstem)


df_withduplicateID <- data.frame(
  doc_id = c("2095/1", "2836/1", "2836/1", "2836/1", "9750/2", 
    "13559/1", "19094/1", "19053/1", "20215/1", "20215/1"), 
  text = c("He do subjects prepared bachelor juvenile ye oh.", 
    "He feelings removing informed he as ignorant we prepared.",
    "He feelings removing informed he as ignorant we prepared.",
    "He feelings removing informed he as ignorant we prepared.",
    "Fond his say old meet cold find come whom. ",
    "Wonder matter now can estate esteem assure fat roused.",
    ".Am performed on existence as discourse is.", 
    "Moment led family sooner cannot her window pulled any.",
    "Why resolution one motionless you him thoroughly.", 
    "Why resolution one motionless you him thoroughly.")     
)


clean_corpus <- function(corpus){
                  corpus <- tm_map(corpus, stripWhitespace)
                  corpus <- tm_map(corpus, removePunctuation)
                  corpus <- tm_map(corpus, tolower)
                  corpus <- tm_map(corpus, lemmatize_strings)
                  return(corpus)
                }

df <- subset(df_withduplicateID, !duplicated(subset(df_withduplicateID, select = ID)))
Corpus <- Corpus(DataframeSource(df))
Corpus_clean <- clean_corpus(Corpus)
Corpus_clean_dtm <- DocumentTermMatrix(Corpus_clean)


burnin <- 4000
iter <- 2000
thin <- 500
seed <-list(203,500,623,1001,765)
nstart <- 5
best <- TRUE
k <- 5

LDAresult_1683 <- LDA(Corpus_clean_dtm, k, method = "Gibbs", 
  control = list(nstart = nstart, seed = seed, best = best, 
  burnin = burnin, iter = iter, thin = thin))

After some searching, it looks like my DocumentTermMatrix may contain empty documents (previously mentioned in here and here, which resulted in this error message.

I then went on removing the empty documents, re-ran the LDA model and everything went smooth. No errors were thrown.

rowTotals <- apply(Corpus_clean_dtm , 1, sum)
Corpus_clean_dtm.new <- Corpus_clean_dtm[rowTotals >0, ]
Corpus_clean_dtm.empty <- Corpus_clean_dtm[rowTotals <= 0, ]
Corpus_clean_dtm.empty$dimnames$Docs

I went on manually looked up the row numberIDs from Corpus_clean_dtm.empty (pulled out all empty document entries) and matched the same IDs(& row number) in 'Corpus_clean', and realising that these documents aren't really 'empty' and each 'empty' document contains at least at least 20 characters.

Am I missing something here?

It's easier to help you if you include a simple [reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) with sample input and desired output that can be used to test and verify possible solutions. — MrFlick, Oct 26 '18 at 15:56
Cheers, I've added a data frame with duplicated IDs and random texts - very similar to my original dataframe in terms of word / character lengths. I expected the code I put here would keep all the unique entries and no non-zero entry error when converting the dataframe to DocumentTermMatrix for LDA. Please let me know if you need anything else! thanks. — byc, Oct 26 '18 at 16:34
Where did you define `k` and all the variables passed in the `control=` parameter? This example is still not reproducible if all variables aren't defined. — MrFlick, Oct 26 '18 at 18:11
If I run your example after changing `select = ID` to `select = doc_id`, I don't have any errors. But you should be aware that turning a corpus into a document term matrix will by default remove words that are shorter than 3. So words like "he", "be", "so" will be removed. If you don't want that you need to adjust the `wordLengths` parameter when creating a document term matrix, like `DocumentTermMatrix(Corpus_clean), control = list(wordLengths = c(1, Inf))` — phiver, Oct 27 '18 at 11:37
@phiver yes I realised the wordlength restraint - when I performed the manual check, I noticed some failed "empty" documents contained words that were being categorised (i.e. appeared as columns) but they were not counted in the document matrix. — byc, Oct 29 '18 at 16:11
After some investigation I found out `tm_map(corpus, lemmatize_strings)` had caused the issue. Strangely by including this line of code, about 97% of my documents was successfully lematised and failed the rest i.e. produced 'non-zero'entry documents. I've also tried taken out the lemmatisation from `clean_corpus()`, and apply at the `DocumentTermMatrix(CorpDescriptionclean, control = list(wordLengths = c(1, Inf), tokenize=LemmaTokenizer))`; where `LemmaTokenizer <- function(x) lemma_tokenizer(x, lemma_hashmap="lemma_hm")`. However they wouldn't 100% lemmatise all strings, any ideas? Thanks! — byc, Oct 29 '18 at 16:14
@cyb, no idea without seeing a few sentences that fail the lemmatisation. It might be something simple or a rule failure. If you could share one or two sentences I can use some other lemmatisation functions to see if they also fail. — phiver, Oct 29 '18 at 17:16
`df = data.frame(doc_id = c("A", "B"), text = c("Cartel, Ref.74339. repair fault to the monitor to the block in the porters lodge.", "Cartel, Ref,77417. Repair fault to the porter lodge computer not working. Links to all door and cameras.")) ` - Here are my failed example documents @phiver; words like cartel, repair, porters, fault were seen in many other documents and were lemmatised & counted in the dtm but failed in these two examples..! thanks! — byc, Oct 29 '18 at 17:47
These sentences seem to lemma correctly with your original clean_corpus function. But don't lemma correctly with the code in your earlier comment. I checked with the udpipe package to see how the lemma's would be created and they look the same as using `lemmatize_strings`. So now the question is why in your code it fails at some points when it shouldn't. — phiver, Oct 30 '18 at 09:47
Thanks so much for looking this up! So I looked through all the failed documents - Among these failed documents, `clean_corpus()` managed to clean about half of the documents (but still wouldn't recognise as text by `LDA()`), and the other half showed up as 'na' which they weren't. I've tried adding wrapper `content_transformer()` within the `tm_map()` and it produced the same result. Converting the original dataframe into corpus was fine and all content showed up, so I'm pretty confident it was the `clean_corpus()`, or applying `tm_map()` to the corpus failed some of the documents. — byc, Oct 30 '18 at 17:13
Are there any solutions to get around this, or am I better off use other R packages? Thanks! — byc, Oct 30 '18 at 17:13

score 0 · Answer 1 · answered Oct 31 '18 at 16:38

After more digging and inspired by the discussion here - correct me if I'm wrong but I think the issue I raised was caused by an actual bug in the tm package. After converting my dataframe to VCorpus() instead of using Corpus(), and add wrapper content_transformer() to all cleaning steps will allow me to lemmatise all documents and apply DocumentTermMatrix() to the clean corpus without any errors. If I don't apply wrapper content_transformer() to the cleaning process, my VCorpus() object will return as a list instead of a corpus structure after cleaning. The LDA() does not throw any errors either.

I'm using tm version 0.7-3 for future reference.

library(tm)
 library(topicmodels)
 library(textstem)


df_withduplicateID <- data.frame(
  doc_id = c("2095/1", "2836/1", "2836/1", "2836/1", "9750/2", 
    "13559/1", "19094/1", "19053/1", "20215/1", "20215/1"), 
  text = c("He do subjects prepared bachelor juvenile ye oh.", 
    "He feelings removing informed he as ignorant we prepared.",
    "He feelings removing informed he as ignorant we prepared.",
    "He feelings removing informed he as ignorant we prepared.",
    "Fond his say old meet cold find come whom. ",
    "Wonder matter now can estate esteem assure fat roused.",
    ".Am performed on existence as discourse is.", 
    "Moment led family sooner cannot her window pulled any.",
    "Why resolution one motionless you him thoroughly.", 
    "Why resolution one motionless you him thoroughly.")     
)


clean_corpus <- function(corpus){
                  corpus <- tm_map(corpus, content_transformer(stripWhitespace))
                  corpus <- tm_map(corpus, content_transformer(removePunctuation))
                  corpus <- tm_map(corpus, content_transformer(tolower))
                  corpus <- tm_map(corpus, content_transformer(lemmatize_strings))
                  return(corpus)
                }

df <- subset(df_withduplicateID, !duplicated(subset(df_withduplicateID, select = ID)))
Corpus <- VCorpus(DataframeSource(df), readerControl = list(reader = reader(DataframeSource(df)), language = "en"))
Corpus_clean <- clean_corpus(Corpus)
Corpus_clean_dtm <- DocumentTermMatrix(Corpus_clean)

DocumentTermMatrix /LDA produces non-zero entry error when there is no empty documents

1 Answers1