29

I am a freshman in LDA and I want to use it in my work. However, some problems appear.

In order to get the best performance, I want to estimate the best topic number. After reading "Finding Scientific topics", I know that I can calculate logP(w|z) firstly and then use the harmonic mean of a series of P(w|z) to estimate P(w|T).

My question is what does the "a series of" mean?

sieste
  • 8,296
  • 3
  • 33
  • 48
Chelsea Wang
  • 599
  • 2
  • 5
  • 19

5 Answers5

16

Unfortunately, there is no hard science yielding the correct answer to your question. To the best of my knowledge, hierarchical dirichlet process (HDP) is quite possibly the best way to arrive at the optimal number of topics.

If you are looking for deeper analyses, this paper on HDP reports the advantages of HDP in determining the number of groups.

Chthonic Project
  • 8,216
  • 1
  • 43
  • 92
13

A reliable way is to compute the topic coherence for different number of topics and choose the model that gives the highest topic coherence. But sometimes, the highest may not always fit the bill.

enter image description here

See this topic modeling example.

Selva
  • 2,045
  • 1
  • 23
  • 18
5

First some people use harmonic mean for finding optimal no.of topics and i also tried but results are unsatisfactory.So as per my suggestion ,if you are using R ,then package"ldatuning" will be useful.It has four metrics for calculating optimal no.of parameters. Again perplexity and log-likelihood based V-fold cross validation are also very good option for best topic modeling.V-Fold cross validation are bit time consuming for large dataset.You can see "A heuristic approach to determine an appropriate no.of topics in topic modeling". Important links: https://cran.r-project.org/web/packages/ldatuning/vignettes/topics.html https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4597325/

Pooja
  • 51
  • 1
  • 1
3

Let k = number of topics

There is no single best way and I am not even sure if there is any standard practices for this.

Method 1: Try out different values of k, select the one that has the largest likelihood.

Method 2: Instead of LDA, see if you can use HDP-LDA

Method 3: If the HDP-LDA is infeasible on your corpus (because of corpus size), then take a uniform sample of your corpus and run HDP-LDA on that, take the value of k as given by HDP-LDA. For a small interval around this k, use Method 1.

Ashok Lathwal
  • 359
  • 1
  • 4
  • 12
1

Since I am working on that same problem, I just want to add the method proposed by Wang et al. (2019) in their paper "Optimization of Topic Recognition Model for News Texts Based on LDA". Besides giving a good overview, they suggest a new method. First you train a word2vec model (e.g. using the word2vec package), then you apply a clustering algorithm capable of finding density peaks (e.g. from the densityClust package), and then use the number of found clusters as number of topics in the LDA algorithm.

If time permits, I will try this out. I also wonder if the word2vec model can make the LDA obsolete.

flowchart from the paper

Karsten W.
  • 17,826
  • 11
  • 69
  • 103
  • 1
    LDA is a generative model, word2vec is not (it's just an embedding model), so the latter cannot render LDA obsolete. This approach replaces the need to specify number of topics in LDA with the need to specify number of features in word2vec. – kfmfe04 Jan 16 '22 at 20:20
  • What do you mean "specify the number of features in word2vec"? I believe it is necessary to specify whether a density peak is a cluster center or not, but otherwise no further assumptions are necessary. – Karsten W. Jan 18 '22 at 22:24
  • You need to pick the number of dimensions: typically, between 100 and 1000 - that's a parameter that impacts the quality of your embeddings (up to a point). – kfmfe04 Jan 30 '22 at 03:32