Gensim LDA: coherence values not reproducible between runs

Question

I used this code, https://datascienceplus.com/evaluation-of-topic-modeling-topic-coherence/, to find topic coherence for a dataset. When I tried this code with the same number of topics, I got new values after each running. For example, for the number of topics =10, I got the following value after 2 running:

First Run for the number of topics =10 Coherence Score CV_1: 0.31230269562327095

Coherence Score UMASS_1: -3.3065236823786064

Second Run the number of topics =10 Coherence Score CV_2: 0.277016662550274

Coherence Score UMASS_2: -3.6146150653617743

What is the reason? In this unstable case, how we can trust this library? The highest coherence value changed as well.

score 8 · Answer 1 · answered Apr 14 '19 at 14:36

TL;DR: coherence is not "stable" -i.e. reproducible between runs - in this case because of fundamental LDA properties. You can make LDA reproducible by setting random seeds and PYTHONHASHSEED=0. You can take other steps to improve your results.

Long Version:

This is not a bug, it's a feature.

It is less a question of trust in the library, but an understanding of the methods involved. The scikit-learn library also has an LDA implementation, and theirs will also give you different results on each run. But by its very nature, LDA is a generative probabilistic method. Simplifying a little bit here, each time you use it, many Dirichlet distributions are generated, followed by inference steps. These steps and distribution generation depend on random number generators. Random number generators, by their definition, generate random stuff, so each model is slightly different. So calculating the coherence of these models will give you different results every time.

But that doesn't mean the library is worthless. It is a very powerful library that is used by many companies (Amazon and Cisco, for example) and academics (NIH, countless researchers) - to quote from gensim's About page:

By now, Gensim is—to my knowledge—the most robust, efficient and hassle-free piece of software to realize unsupervised semantic modelling from plain text.

If that is what you want, gensim is the way to go - certainly not the only way to go (tmtoolkit or sklearn also have LDA) but a pretty good choice of paths. That being said, there are ways to ensure reproducability between model runs.

Gensim Reproducability

Set PYTHONHASHSEED=0

From the Python documentation: "On Python 3.3 and greater, hash randomization is turned on by default."

Use random_state in your model specification

Afaik, all of the gensim methods have a way of specifying the random seed to be used. Choose any number you like, but the default value of zero ("off") and use the same number for each rerun - this ensures that the same input into the random number generators always results in the same output (gensim ldamodel documentation).

Use ldamodel.save() and ldamodel.load() for model persistency

This is also a very useful, timesaving step that keeps you from having to re-run your models every time you start (very important for long-running models).

Optimize your models and data

This doesn't technically make your models perfectly reproducable, but even without the random seed settings, you will see your model perform better (at the cost of computation time) if you increase iterationsor passes. Preprocessing also makes a big difference and is an art unto itself - do you choose to lemmatize or stem and why do you do so? This all can have important effects on the outputs and your interpretations.

Caveat: you must use one core only

Multicore methods (LdaMulticore and the distributed versions) are never 100% reproducible, because of the way the operating system handles multiprocessing.

Many thanks for your answer. Sorry for my late response. – Panda Feb 19 '21 at 18:52 — Panda, Feb 19 '21 at 18:52

Gensim LDA: coherence values not reproducible between runs

1 Answers1