How to determine the number of topics in the LDA (Latent Dirichlet Allocation) alogrithm for text clustering?

Question

I am using the LDA algorithm to cluster many documents into different topics. The LDA algorithm needs an input parameter: the number of topics. How could I determine this?

I am using the Reuter corpora to benchmark my solution. And Reuter corpora has topic numbers ready. Should I input the the same topic number when I clustering Reuter text? And comparing my clustering result to Reuter's?

But when in production, how could I know the number of topics before I actually cluster based on the topics. It's kind of like a chicken-egg problem.

the answer is MAGICAL!!! actually there are more than the #topic parameter, if you're using the original LDA, you have the alpha parameter, beta parameter to set also. — alvas, Jan 14 '14 at 09:29
there is no proper solution to say number x is the right number of topics. so they end up using HDP. hierarchical dirichlet process. http://metaoptimize.com/qa/questions/5221/automatically-selecting-the-number-of-topics-in-lda — alvas, Jan 14 '14 at 09:33
see also http://link.springer.com/chapter/10.1007%2F978-3-642-13657-3_43 — alvas, Jan 14 '14 at 09:34
possible duplicate of [how to determine the number of topics for LDA?](http://stackoverflow.com/questions/17421887/how-to-determine-the-number-of-topics-for-lda) — Chthonic Project, Jan 16 '14 at 00:23

score 1 · Answer 1 · answered Mar 01 '17 at 09:42

1

One way you can approach this is through k means. Through Silhouette (or the elbow curves, but I guess that will require manual intervention) you can get the optimal number of clusters. You can use this number as the number of topics.

answered Mar 01 '17 at 09:42

Clock Slave

7,627
15
68
109

How to determine the number of topics in the LDA (Latent Dirichlet Allocation) alogrithm for text clustering?

1 Answers1