4

I use the following code to do the topic modeling on my documents:

from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
tfidf_vectorizer = TfidfVectorizer(tokenizer=tokenize, max_df=0.85, min_df=3, ngram_range=(1,5))

tfidf = tfidf_vectorizer.fit_transform(docs)
tfidf_feature_names = tfidf_vectorizer.get_feature_names()


from sklearn.decomposition import NMF

no_topics = 50

%time nmf = NMF(n_components=no_topics, random_state=11,  init='nndsvd').fit(tfidf)
topic_pr= nmf.transform(tfidf)

I thought topic_pr gives me the probability distribution of different topics for each document. In other words, I expected that the numbers in the output(topic_pr) would be probabilities that the document in row X belongs to each of the 50 topics in model. But, the numbers do not add to 1. Are these really probabilities? If no, is there a way to convert them to probabilities?

Thanks

1 Answers1

1

NMF returns a non-negative factorization, doesn't have anything to do with probabilities (to the best of my knowledge). If you just want probabilities you could transform the output of NMF (L1 normalization)

probs = topic_pr / topic_pr.sum(axis=1, keepdims=True)

This assumes that topic_pr is a non-negative matrix, which is true in your case.


EDIT: Apparently there is a probabilistic version of NMF.

Quoting sklearn's documetation:

Non-negative Matrix Factorization is applied with two different objective functions: the Frobenius norm, and the generalized Kullback-Leibler divergence. The latter is equivalent to Probabilistic Latent Semantic Indexing.

To apply the latter, which is what you seem to need, from the same link:

lda = LatentDirichletAllocation(n_components=no_topics, max_iter=5)
topic_pr = lda.fit_transform(tfidf)
Imanol Luengo
  • 15,366
  • 2
  • 49
  • 67
  • thanks for the help. I tried `code` nmf = NMF(n_components=no_topics, random_state=1, beta_loss='kullback-leibler', solver='mu', alpha=.1, l1_ratio=.5).fit(tfidf) but still the outcomes don't add up to 1. Did I do anything wrong? – Monica Muller Oct 10 '17 at 15:27
  • Did you try using the last two lines of my answer instead of NMF? – Imanol Luengo Oct 10 '17 at 16:03