Step 1
I'm using R and the "topicmodels" package to build a LDA model from a 4.5k documents corpus. I do the usual pre-processing steps (stopwords, cut low/high words frequencies, lemmatization) and end up with a 100 topics model that I'm happy with. In fact, it's an almost perfect model for my needs.
justlda <- LDA(k=100, x=dtm_lemma, method="Gibbs", control=control_list_gibbs)
Step 2
I then pre-process using the same exact process as above a new (unseen by the model) 300 documents corpus, then transform it into a document-term matrix, then use the "posterior" function of the same package to predict the topics on the new data. This corpus is coming from the same authors and is very similar to the training set.
My problem
The predictions (posterior probabilities) I get are totally wrong. This is the code I'm using to get the posterior:
topics = posterior(justlda, dtm_lemma, control = control_list_gibbs)$topics
- justlda is the model built with the whole corpus in step 1.
- dtm_lemma is the pre-processed document-term matrix of the new data.
- control is lda parameters (same for both).
I feel that not only are the predictions wrong, the topics weights are very low. Nothing is coming out as a dominant topic. (For this 100 topics model, most topics come out as 0.08 and I'm lucky to get a 0.20 weight that is not even relevant...)
I got less than a year of experience with NLP/LDA and the R language. I feel I could be making a very amateur mistake somewhere that could explain the wrong predictions?
Is this kind of results normal? What could I be possibly doing wrong?