2

I am running LDA on a small corpus of 2 docs (sentences) for testing purposes. Following code returns topic-term and document-topic distributions that are not reasonable at all given the input documents. Running exactly the same returns in Python reasonable results. Who knows what is wrong here?

library(topicmodels)
library(tm)

d1 <- "bank bank bank"
d2 <- "stock stock stock"

corpus <- Corpus(VectorSource(c(d1,d2)))

##fit lda to data
dtm <- DocumentTermMatrix(corpus)
ldafit <- LDA(dtm, k=2, method="Gibbs") 

##get posteriors
topicTerm <- t(posterior(ldafit)$terms)
docTopic <- posterior(ldafit)$topics
topicTerm
docTopic

> topicTerm
              1         2
bank  0.3114525 0.6885475
stock 0.6885475 0.3114525
> docTopic
          1         2
1 0.4963245 0.5036755
2 0.5036755 0.4963245

The results from Python are as follows:

>>> docTopic
array([[ 0.87100799,  0.12899201],
       [ 0.12916713,  0.87083287]])
>>> fit.print_topic(1)
u'0.821*"bank" + 0.179*"stock"'
>>> fit.print_topic(0)
u'0.824*"stock" + 0.176*"bank"'
KenHBS
  • 6,756
  • 6
  • 37
  • 52
schimo
  • 93
  • 1
  • 9
  • Interesting question. I have used this package in a real setting and it produced good results. I'm pretty sure these stinky results have something to do with the tiny corpus you're using. Could you post the results from Python for comparison? – KenHBS Sep 13 '17 at 10:04
  • Sure, please see the updated post. – schimo Sep 13 '17 at 10:28
  • `gensim` and `topicmodels` use different methods. `gensim` uses variational inference, whereas `topicmodels` uses collapsed gibbs sampling here. `topicmodels` also has a variational inference option, but it gives the same crappy results.. I'm at a dead end too, sorry – KenHBS Sep 13 '17 at 10:35
  • But the results in R with topicmodels nearly do not change when using 'VEM' as estimation method... – schimo Sep 13 '17 at 12:43

2 Answers2

3

The author of the R package topicmodels, Bettina Grün, pointed out that this is due to the selection of the hyperparameter 'alpha'.

LDA in R selects alpha = 50/k= 25 while LDA in gensim Python selects alpha = 1/k = 0.5. A smaller alpha value favors sparse solutions of document-topic distributions, i.e. documents contain mixture of just a few topics. Hence, decreasing alpha in LDA in R yields very reasonable results:

ldafit <- LDA(dtm, k=2, method="Gibbs", control=list(alpha=0.5)) 

posterior(ldafit)$topics
#    1     2
# 1  0.125 0.875
# 2  0.875 0.125

posterior(ldafit)$terms
#   bank    stock
# 1 0.03125 0.96875
# 2 0.96875 0.03125
KenHBS
  • 6,756
  • 6
  • 37
  • 52
schimo
  • 93
  • 1
  • 9
  • Great answer! For more information on the importance of priors in LDA, check this paper: *Rethinking LDA: Why priors matter* http://dirichlet.net/pdf/wallach09rethinking.pdf – KenHBS Sep 14 '17 at 09:41
0

Try to plot the perplexity over iterations and make sure they converge. Initial status also matters. (The document size and sample size both seem to be small, though.)

Nelly Kong
  • 279
  • 1
  • 2
  • 18