2

I have started to use Galago for document retrieval. I want to cluster some documents (initially retrieved documents with any model) using LDA. I prefer to use a java-based implementation that can be integrated into my code using Galago. I'd appreciate it if you could let me know what open source implementation of LDA is more suitable for my purpose.

Thank you in advance for your help!

Magen
  • 23
  • 4

1 Answers1

0

There's a fast algorithm for LDA from this paper:

S. Arora, R. Ge, Y. Halpern, D. Mimno, A. Moitra, D. Sontag, Y. Wu, M. Zhu. A Practical Algorithm for Topic Modeling with Provable Guarantees. 30th International Conference on Machine Learning (ICML), 2013.

Which has a Java implementation by one of the authors (D. Mimno) on github here: https://github.com/mimno/anchor

I've poked around with this implementation briefly, and found good and fast results. Like all LDA/Topic modeling, getting the number of topics right can be challenging.

John Foley
  • 957
  • 9
  • 19
  • Hi John, Thank you for your help. I just have one question: in the output file of train-anchor specified by --topics-file, are the probabilities p(topic | word) * p(word)? I have this question because the manual says p(topic|word), but in the code, I found wordProb * weights[topic]; Thanks again! – Magen May 13 '16 at 07:50