How to define the optimal number of topics (k)?

Question

I want to know that is the best topic number (k) to feed to gensim for LDA, I've found an answer on StackOverflow. However, I got an error mentioned below.

Here is the link to the suggested way to feed the number of the optimal topics that I've found.

What is the best way to obtain the optimal number of topics for a LDA-Model using Gensim?

# import modules 

import seaborn as sns
import matplotlib.pyplot as plt
from gensim.models import LdaModel, CoherenceModel
from gensim import corpora

# make models with n k

dirichlet_dict = corpora.Dictionary(corpus)
bow_corpus = [dirichlet_dict.doc2bow(text) for text in corpus]

# Considering 1-15 topics, as the last is cut off
num_topics = list(range(16)[1:])
num_keywords = 15

LDA_models = {}
LDA_topics = {}
for i in num_topics:
    LDA_models[i] = LdaModel(corpus=bow_corpus,
                             id2word=dirichlet_dict,
                             num_topics=i,
                             update_every=1,
                             chunksize=len(bow_corpus),
                             passes=20,
                             alpha='auto',
                             random_state=42)

    shown_topics = LDA_models[i].show_topics(num_topics=num_topics, 
                                             num_words=num_keywords,
                                             formatted=False)
    LDA_topics[i] = [[word[0] for word in topic[1]] for topic in shown_topics]

When I try to implent the code i got this error:

-> 1145         if num_topics < 0 or num_topics >= self.num_topics:
   1146             num_topics = self.num_topics
   1147             chosen_topics = range(num_topics)

TypeError: '<' not supported between instances of 'list' and 'int'

score 2 · Accepted Answer · answered Nov 08 '20 at 17:22

2

This line:

shown_topics = LDA_models[i].show_topics(num_topics=num_topics

should be:

shown_topics = LDA_models[i].show_topics(num_topics=i

Arguably, this happened because of a bad variable naming. It could be avoided by replacing num_topics = list(range(16)[1:]) and the subsequent loop with:

max_topics = 15
for num_topics in range(1, max_topics+1):
    # use num_topics instead of i in the loop

This would eliminate the possible confusion

answered Nov 08 '20 at 17:22

Marat

15,215
2
39
48

Can you walk me through how he computed the Coherence, please? Because I've fixed the first part as you recommend ed, but still having errors when I ran the code of Coherences. Many thanks. – Mohamed Hachaichi Nov 08 '20 at 22:05
I did not compute Coherence, I only pointed at the problem in the question. From my personal experience, optimizing number of topics like this doesn't make much sense. Target functions tend to be very flat, so the choice gets to be very sensitive to irregularities in the training data. Reasonable manual choice typically performs better – Marat Nov 08 '20 at 22:08

How to define the optimal number of topics (k)?

1 Answers1