0

Gensim has a tutorial saying how to, given a document/query string, say what other documents are most similar to it, in descending order:

http://radimrehurek.com/gensim/tut3.html

It can also display what topics are associated with an entire model at all:

How to print the LDA topics models from gensim? Python

But how do you find what topics are associated with a given document/query string? Ideally with some numeric similarity metric for each topic? I haven't been able to find anything on that.

Community
  • 1
  • 1
rwallace
  • 31,405
  • 40
  • 123
  • 242
  • Could the topics be contained in the query string, or are they mutually exclusive? – Nathan McCoy Mar 24 '17 at 09:19
  • @NathanMcCoy Mutually exclusive; when gensim talks about topics, it doesn't mean the ordinary English sense of the word, it means a data structure consisting of a vector of words together with floating point weights. – rwallace Mar 24 '17 at 10:07

1 Answers1

2

If you want to find the topic distribution of unseen documents then you need to convert the document of interest into a bag of words representation

from gensim import utils, models
from gensim.corpora import Dictionary
lda = models.LdaModel.load('saved_lda.model') # load saved model
dictionary = Dictionary.load('saved_dictionary.dict') # load saved dict
text = ' '
with open('document', 'r') as inp: # convert file to string
    for line in inp:
        text += line + ' '
tkn_doc = utils.simple_preprocess(text) # filter & tokenize words
doc_bow = dictionary.doc2bow(tkn_doc) # use dictionary to create bow
doc_vec = lda[doc_bow] # this is the topic probability distribution for the document of interest

From this code you get a sparse vector where the indices represent the topics 0....n and each 'weight' is the probability that the words in the document belong to that topic in the model. You can visualize the distribution by creating a bar graph using matplotlib.

y_axis = []
x_axis = []
for topic_id, dist in enumerate(doc_vec):
    x_axis.append(topic_id + 1)
    y_axis.append(dist)
width = 1 
plt.bar(x_axis, y_axis, width, align='center', color='r')
plt.xlabel('Topics')
plt.ylabel('Probability')
plt.title('Topic Distribution for doc')
plt.xticks(np.arange(2, len(x_axis), 2), rotation='vertical', fontsize=7)
plt.subplots_adjust(bottom=0.2)
plt.ylim([0, np.max(y_axis) + .01])
plt.xlim([0, len(x_axis) + 1])
plt.savefig(output_path)
plt.close()

enter image description here

If you want to see the topn terms in each topic you can print them like this. Referencing the graph, you can look up the topn words you printed and determine how the document was interpreted by the model. You can also find distances between two different document probability distribution vectors by using vector calculations like hellinger distance, euclidean, jensen shannon etc.

Kenneth Orton
  • 399
  • 1
  • 11