How come probabilities returned by Gensim LSI method show_topics are negative?

Question

(0, '0.707*"उत्तरपश्चिमी" + 0.707*"यूरोप" + -0.000"बुद्ध" + -0.000*"जन्म" + ' '0.000*"बेल्जियम" + 0.000*"किंगडम" + 0.000*"नेपाल" + 0.000*"ऑफ़" + ' '-0.000"युन" + -0.000"स्थली"*')]
Where as the documentation says
show_topics(num_topics=-1, num_words=10, log=False, formatted=True)
Return num_topics most significant topics (return all by default). For each topic, show num_words most significant words (10 words by default).

The topics are returned as a list – a list of strings if formatted is True, or a list of (word, probability) 2-tuples if False.

If log is True, also output this result to log.

def preprocessing(corpus):
    for document in corpus:
        doc = strip_short(document,3)
        doc = strip_punctuation(doc)
        yield word_tokenize(doc)
texts = preprocessing(corpus)
dictionary = corpora.Dictionary(texts)
dictionary.filter_extremes(no_below=1, keep_n=25000)

doc_term_matrix = [dictionary.doc2bow(tokens) for tokens in preprocessing(corpus)]
tfidf = models.TfidfModel(doc_term_matrix)
corpus_tfidf = tfidf[doc_term_matrix]

lsi = models.LsiModel(corpus_tfidf, id2word=dictionary)
pprint(lsi.show_topics(num_topics=4, num_words=10))

[(0,
  '0.707*"उत्तरपश्चिमी" + 0.707*"यूरोप" + -0.000*"बुद्ध" + -0.000*"जन्म" + '
  '0.000*"बेल्जियम" + 0.000*"किंगडम" + 0.000*"नेपाल" + 0.000*"ऑफ़" + '
  '-0.000*"युन" + -0.000*"स्थली"'),
 (1,
  '0.577*"किंगडम" + 0.577*"बेल्जियम" + 0.577*"ऑफ़" + -0.000*"जन्म" + '
  '-0.000*"बुद्ध" + -0.000*"भगवान" + -0.000*"स्थित" + -0.000*"लुंबिनी" + '
  '-0.000*"उत्तरपश्चिमी" + -0.000*"यूरोप"'),
 (2,
  '0.354*"जन्म" + 0.354*"भगवान" + 0.354*"स्थित" + 0.354*"स्थली" + 0.354*"युन" '
  '+ 0.354*"बुद्ध" + 0.354*"लुंबिनी" + 0.354*"नेपाल" + 0.000*"उत्तरपश्चिमी" + '
  '0.000*"यूरोप"')]

Manish · Answer 1 · 2020-09-10T09:28:36.657

1

Thanks for using SO.

The show_topics provides you the most significant topics from the corpus. The probabilities that you see are the contribution from each word towards the topic. For e.g. "उत्तरपश्चिमी" and "यूरोप" have contribution of 0.707 each while "बेल्जियम" has 0.000 contribution towards defining this topic.

When showing the contribution of word, the model displays the greatest absolute value but due to truncation of floating numbers that are close to 0( say -0.0000008),they are shown as -0.00.

References: https://radimrehurek.com/gensim/models/lsimodel.html

edited Sep 10 '20 at 09:28

answered Sep 09 '20 at 16:00

Manish

1,999
2
15
26

1

You could add further: this is an artifact of the underlying 'floating point' representation, which internally can have a positive or negative sign on `0.0`. So either may show up as the result of some calculations, but both mean the same thing and count as equivalent. For example, try `a=0.0; b=-0.0`, `print(a, b)`, `print(a==b)`, `print(a*b)`. – gojomo Sep 09 '20 at 23:07
Could you please elaborate what does the negative sign indicates – hrithik auchar Sep 10 '20 at 08:13
Nothing. It is an artifact of the internal representation. Try running the example code in my comment at a Python interpreter: you'll see that `0.0` and `-0.0` are equivalent. It is safe to ignore it, but if you're curious, there's more discussion at: https://stackoverflow.com/questions/4083401/negative-zero-in-python – gojomo Sep 10 '20 at 19:10

How come probabilities returned by Gensim LSI method show_topics are negative?

1 Answers1