34

I've got a trained LDA model and I want to calculate the similarity score between two documents from the corpus I trained my model on. After studying all the Gensim tutorials and functions, I still can't get my head around it. Can somebody give me a hint? Thanks!

still_st
  • 363
  • 1
  • 3
  • 7

3 Answers3

36

Depends what similarity metric you want to use.

Cosine similarity is universally useful & built-in:

sim = gensim.matutils.cossim(vec_lda1, vec_lda2)

Hellinger distance is useful for similarity between probability distributions (such as LDA topics):

import numpy as np
dense1 = gensim.matutils.sparse2full(lda_vec1, lda.num_topics)
dense2 = gensim.matutils.sparse2full(lda_vec2, lda.num_topics)
sim = np.sqrt(0.5 * ((np.sqrt(dense1) - np.sqrt(dense2))**2).sum())
Radim
  • 4,208
  • 3
  • 27
  • 38
  • 2
    what is the variable you specify as `lda_vec1` ? when I use `lda[corpus[i]]`, I just get the top 3 or 4 topics contributing to document `i` with the rest of the topic weights being 0.0. however I know that LDA should produce a topic distribution for **all** topics for every document. is there some efficient way (maybe using gensim `index`) to compare a query document to every other document in the corpus _using the hellinger distance_ ? – PyRsquared Jun 08 '17 at 16:58
  • 2
    This is the same thing that is confusing me. Radim, your answer will be highly appreciated what is lda_vec1 is it lda[corpus[i]]. If so, how we can build similarity matrix for the whole corpus vs new documents – Armin Alibasic May 27 '18 at 08:20
  • I think u need to carray all the probability value of each topic by setting minimum_probaility threshold to "0". After one can start building index. – Gaurav Koradiya Aug 13 '20 at 11:12
  • Gensim has a built-in function for computing the Hellinger distance: https://radimrehurek.com/gensim/matutils.html#gensim.matutils.hellinger – Befeepilf Oct 09 '20 at 17:08
  • I know this is quite late but I found an example of what Radim meant by `lda_vec1` in this [chat](https://groups.google.com/g/gensim/c/BVu5-pD6910/m/VdT730HnFgAJ). – yudhiesh Feb 26 '21 at 13:40
26

Don't know if this'll help but, I managed to attain successful results on document matching and similarities when using the actual document as a query.

dictionary = corpora.Dictionary.load('dictionary.dict')
corpus = corpora.MmCorpus("corpus.mm")
lda = models.LdaModel.load("model.lda") #result from running online lda (training)

index = similarities.MatrixSimilarity(lda[corpus])
index.save("simIndex.index")

docname = "docs/the_doc.txt"
doc = open(docname, 'r').read()
vec_bow = dictionary.doc2bow(doc.lower().split())
vec_lda = lda[vec_bow]

sims = index[vec_lda]
sims = sorted(enumerate(sims), key=lambda item: -item[1])
print sims

Your similarity score between all documents residing in the corpus and the document that was used as a query will be the second index of every sim for sims.

Palisand
  • 1,302
  • 1
  • 14
  • 34
  • Thanks for your answer, if we want to find similar documents based on multiple query then how we have to proceed ? – Abhishek Sachan May 24 '17 at 12:36
  • I'm using this approach to calculate similarity and I'm getting many similarity scores of 1. What might be the problem? –  Sep 30 '18 at 08:19
  • 1
    According to me, this approach might not right because document is probability distribution over dimension in LDA. U might not get expected results everytime. If u will use LSI at that time than this approach is right because it represent document in LSI space and its not yielding probability distribution like LDA. One can look in to Jasson-Shanon and Kullback-divergence distance for comparison of probability distribution. – Gaurav Koradiya Aug 13 '20 at 10:57
  • @GauravKoradiya Are you sure? It gives me pretty good results. A similar approach of LDA/LSI + MatrixSimilarity is discussed on [Gensim's Github](https://github.com/RaRe-Technologies/gensim/issues/2644) and Radim Rehurek doesn't seem to indicate it would be a wrong approach. Calculating Jensen_Shannon distance seems problematic, and I've never got it working well like this. – Banik Aug 22 '22 at 18:56
6

Provided answers are good, but they aren't very beginner-friendly. I want to start from training the LDA model and calculate cosine similarity.

Training model part:

docs = ["latent Dirichlet allocation (LDA) is a generative statistical model", 
        "each document is a mixture of a small number of topics",
        "each document may be viewed as a mixture of various topics"]

# Convert document to tokens
docs = [doc.split() for doc in docs]

# A mapping from token to id in each document
from gensim.corpora import Dictionary
dictionary = Dictionary(docs)

# Representing the corpus as a bag of words
corpus = [dictionary.doc2bow(doc) for doc in docs]

# Training the model
model = LdaModel(corpus=corpus, id2word=dictionary, num_topics=10)

For extracting the probability assigned to each topic for a document, there are generally two ways. I provide here the both:

# Some preprocessing for documents like the training the model
test_doc = ["LDA is an example of a topic model",
            "topic modelling refers to the task of identifying topics"]
test_doc = [doc.split() for doc in test_doc]
test_corpus = [dictionary.doc2bow(doc) for doc in test_doc]

# Method 1
from gensim.matutils import cossim
doc1 = model.get_document_topics(test_corpus[0], minimum_probability=0)
doc2 = model.get_document_topics(test_corpus[1], minimum_probability=0)
print(cossim(doc1, doc2))

# Method 2
doc1 = model[test_corpus[0]]
doc2 = model[test_corpus[1]]
print(cossim(doc1, doc2))

output:

#Method 1
0.8279631530869963

#Method 2
0.828066885140262

As you can see both of the methods are generally the same, the difference is in the probabilities returned in the 2nd method sometimes doesn't add up to one as discussed here. For large corpus, the possibility vector could be given by passing the whole corpus:

#Method 1
possibility_vector = model.get_document_topics(test_corpus, minimum_probability=0)
#Method 2
possiblity_vector = model[test_corpus]

NOTE: The sum of probability assigned to each topic in a document may become a bit higher than 1 or in some cases a bit less than 1. That is because of the floating-point arithmetic rounding errors.

mrghofrani
  • 1,335
  • 2
  • 13
  • 32
  • 1
    this was helpful for me. Maybe you could offer some advice on my problem: https://stackoverflow.com/questions/63306812/gensim-for-similarities/. Specifically I would like to iterate your method for cosine similarity over rows of a dataframe (while the overall corpus is composed by the whole dataframe) – Mobeus Zoom Aug 07 '20 at 20:12
  • Good answer. Solved my problem. – Gaurav Koradiya Aug 13 '20 at 11:16