7

I am using sklearn's NMF and LDA sub-modules to analyze unlabeled text. I read the documentation but I am not sure if the transform functions in these modules (NMF and LDA) are the same as the posterior function in R's topicmodels (please see Predicting LDA topics for new data). Basically, I am looking for a function that will allow me to predict the topics in test set using the model trained on training set data. I predicted topics on the entire dataset. Then I split the data into train and test sets, trained a model on train set and transformed test set using that model. though it was expected that I would not get the same results, comparing the two runs topics is not assuring me that the transform function serves the same function as R's package. I would appreciate your response.

thank you

Community
  • 1
  • 1
valearner
  • 587
  • 2
  • 7
  • 14

1 Answers1

12

The call to transform on a LatentDirichletAllocation model returns an unnormalized document topic distribution. To get proper probabilities, you can simply normalize the result. Here is an example:

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import LatentDirichletAllocation
from sklearn.datasets import fetch_20newsgroups
import numpy as np

# grab a sample data set
dataset = fetch_20newsgroups(shuffle=True, remove=('headers', 'footers', 'quotes'))
train,test = dataset.data[:100], dataset.data[100:200]

# vectorizer the features
tf_vectorizer = TfidfVectorizer(max_features=25)
X_train = tf_vectorizer.fit_transform(train)

# train the model
lda = LatentDirichletAllocation(n_topics=5)
lda.fit(X_train)

# predict topics for test data
# unnormalized doc-topic distribution
X_test = tf_vectorizer.transform(test)
doc_topic_dist_unnormalized = np.matrix(lda.transform(X_test))

# normalize the distribution (only needed if you want to work with the probabilities)
doc_topic_dist = doc_topic_dist_unnormalized/doc_topic_dist_unnormalized.sum(axis=1)

To find the top ranking topic you can do something like:

doc_topic_dist.argmax(axis=1)
Ryan Walker
  • 3,176
  • 1
  • 23
  • 29
  • Thank you Ryan, something that I was thinking: NMF model, and LDA I believe at least, lda module (not sklearn), produces two matrices W and H. Would it be ok to predict test data set by first X_test = tf_vectorizer.transform(test) and then X_test*H.T? – valearner Nov 16 '16 at 19:07