3

I am using the LDA algorithm to cluster many documents into different topics. The LDA algorithm needs an input parameter: the number of topics. How could I determine this?

I am using the Reuter corpora to benchmark my solution. And Reuter corpora has topic numbers ready. Should I input the the same topic number when I clustering Reuter text? And comparing my clustering result to Reuter's?

But when in production, how could I know the number of topics before I actually cluster based on the topics. It's kind of like a chicken-egg problem.

smwikipedia
  • 61,609
  • 92
  • 309
  • 482
  • the answer is MAGICAL!!! actually there are more than the #topic parameter, if you're using the original LDA, you have the alpha parameter, beta parameter to set also. – alvas Jan 14 '14 at 09:29
  • 1
    there is no proper solution to say number x is the right number of topics. so they end up using HDP. hierarchical dirichlet process. http://metaoptimize.com/qa/questions/5221/automatically-selecting-the-number-of-topics-in-lda – alvas Jan 14 '14 at 09:33
  • 1
    see also http://link.springer.com/chapter/10.1007%2F978-3-642-13657-3_43 – alvas Jan 14 '14 at 09:34
  • possible duplicate of [how to determine the number of topics for LDA?](http://stackoverflow.com/questions/17421887/how-to-determine-the-number-of-topics-for-lda) – Chthonic Project Jan 16 '14 at 00:23
  • Have you looked into nonparametric LDA? – duhaime Apr 10 '15 at 21:13

1 Answers1

1

One way you can approach this is through k means. Through Silhouette (or the elbow curves, but I guess that will require manual intervention) you can get the optimal number of clusters. You can use this number as the number of topics.

Clock Slave
  • 7,627
  • 15
  • 68
  • 109