17

I've derived a LDA topic model using a toy corpus as follows:

documents = ['Human machine interface for lab abc computer applications',
             'A survey of user opinion of computer system response time',
             'The EPS user interface management system',
             'System and human system engineering testing of EPS',
             'Relation of user perceived response time to error measurement',
             'The generation of random binary unordered trees',
             'The intersection graph of paths in trees',
             'Graph minors IV Widths of trees and well quasi ordering',
             'Graph minors A survey']

texts = [[word for word in document.lower().split()] for document in documents]
dictionary = corpora.Dictionary(texts)

id2word = {}
for word in dictionary.token2id:    
    id2word[dictionary.token2id[word]] = word

I found that when I use a small number of topics to derive the model, Gensim yields a full report of topical distribution over all potential topics for a test document. E.g.:

test_lda = LdaModel(corpus,num_topics=5, id2word=id2word)
test_lda[dictionary.doc2bow('human system')]

Out[314]: [(0, 0.59751626959781134),
(1, 0.10001902477790173),
(2, 0.10001375856907335),
(3, 0.10005453508763221),
(4, 0.10239641196758137)]

However when I use a large number of topics, the report is no longer complete:

test_lda = LdaModel(corpus,num_topics=100, id2word=id2word)

test_lda[dictionary.doc2bow('human system')]
Out[315]: [(73, 0.50499999999997613)]

It seems to me that topics with a probability less than some threshold (I observed 0.01 to be more specific) are omitted form the output.

I'm wondering if this behaviour is due to some aesthetic considerations? And how can I get the distribution of the probability mass residual over all other topics?

Thank you for your kind answer!

Moses Xu
  • 2,140
  • 4
  • 24
  • 35

2 Answers2

8

Read the source and it turns out that topics with probabilities smaller than a threshold are ignored. This threshold is with a default value of 0.01.

Moses Xu
  • 2,140
  • 4
  • 24
  • 35
8

I realise this is an old question but in case someone stumbles upon it, here is a solution (the issue has actually been fixed in the current development branch with a minimum_probability parameter to LdaModel but maybe you're running an older version of gensim).

define a new function (this is just copied from the source)

def get_doc_topics(lda, bow):
    gamma, _ = lda.inference([bow])
    topic_dist = gamma[0] / sum(gamma[0])  # normalize distribution
    return [(topicid, topicvalue) for topicid, topicvalue in enumerate(topic_dist)]

the above function does not filter the output topics based on the probability but will output all of them. If you don't need the (topic_id, value) tuples but just values, just return the topic_dist instead of the list comprehension (it'll be much faster as well).

Matti Lyra
  • 12,828
  • 8
  • 49
  • 67
  • Hi, is gamma the probability distribution over topics? Sorry if this sounds silly, I am not very much familiar with internals of LDA. Because the documentation reads : "Given a chunk of sparse document vectors, estimate gamma (parameters controlling the topic weights) for each document in the chunk." . I figured for inference gensim offers a generator (lda[corpus]). – Shashank Sep 08 '16 at 21:02
  • 1
    `gamma` is the unnormalised topic scores per document, `topic_dist` is the probability distribution. Yes `gensim` offers a generator `lda[corpus]`, that generator uses `lda.inference` internally. As I say above, if you _do not_ need the `(topic_id, probability)` pairs then it's going to be faster to call `.inference` yourself. You may need to perform chunking if your corpus if very large and does not fit into memory, `lda[corpus]` does the chunking internally as well. – Matti Lyra Sep 13 '16 at 07:27
  • 1
    NB Use the following to normalize the distribution for all topics, not only for the first topic_dist = gamma / gamma.sum(axis=1)[:, None] – aless80 Oct 21 '19 at 10:13