probability distribution of topics using NMF

Question

I use the following code to do the topic modeling on my documents:

from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
tfidf_vectorizer = TfidfVectorizer(tokenizer=tokenize, max_df=0.85, min_df=3, ngram_range=(1,5))

tfidf = tfidf_vectorizer.fit_transform(docs)
tfidf_feature_names = tfidf_vectorizer.get_feature_names()


from sklearn.decomposition import NMF

no_topics = 50

%time nmf = NMF(n_components=no_topics, random_state=11,  init='nndsvd').fit(tfidf)
topic_pr= nmf.transform(tfidf)

I thought topic_pr gives me the probability distribution of different topics for each document. In other words, I expected that the numbers in the output(topic_pr) would be probabilities that the document in row X belongs to each of the 50 topics in model. But, the numbers do not add to 1. Are these really probabilities? If no, is there a way to convert them to probabilities?

Thanks

Imanol Luengo · Answer 1 · 2017-10-10T07:40:55.073

1

NMF returns a non-negative factorization, doesn't have anything to do with probabilities (to the best of my knowledge). If you just want probabilities you could transform the output of NMF (L1 normalization)

probs = topic_pr / topic_pr.sum(axis=1, keepdims=True)

This assumes that topic_pr is a non-negative matrix, which is true in your case.

EDIT: Apparently there is a probabilistic version of NMF.

Quoting sklearn's documetation:

Non-negative Matrix Factorization is applied with two different objective functions: the Frobenius norm, and the generalized Kullback-Leibler divergence. The latter is equivalent to Probabilistic Latent Semantic Indexing.

To apply the latter, which is what you seem to need, from the same link:

lda = LatentDirichletAllocation(n_components=no_topics, max_iter=5)
topic_pr = lda.fit_transform(tfidf)

edited Oct 10 '17 at 07:40

answered Oct 10 '17 at 07:22

Imanol Luengo

15,366
2
49
67

thanks for the help. I tried `code` nmf = NMF(n_components=no_topics, random_state=1, beta_loss='kullback-leibler', solver='mu', alpha=.1, l1_ratio=.5).fit(tfidf) but still the outcomes don't add up to 1. Did I do anything wrong? – Monica Muller Oct 10 '17 at 15:27
Did you try using the last two lines of my answer instead of NMF? – Imanol Luengo Oct 10 '17 at 16:03

probability distribution of topics using NMF

1 Answers1