0

I have implemented finding similar documents based on a particular document using LDA Model (using Gensim). Next thing i want to do is if I have multiple documents then how to get similar document based on the multiple documents provided as input.

I implemented LDA using this link

sample code for single query -

dictionary = corpora.Dictionary.load('dictionary.dict')
corpus = corpora.MmCorpus("corpus.mm")
lda = models.LdaModel.load("model.lda") #result from running online lda (training)

index = similarities.MatrixSimilarity(lda[corpus])
index.save("simIndex.index")

docname = "docs/the_doc.txt"
doc = open(docname, 'r').read()
vec_bow = dictionary.doc2bow(doc.lower().split())
vec_lda = lda[vec_bow]

sims = index[vec_lda]
sims = sorted(enumerate(sims), key=lambda item: -item[1])
print sims

Now if I have another doc then how to implement it.

Abhishek Sachan
  • 934
  • 3
  • 13
  • 26

2 Answers2

0

You can use lda.update(corpus_new) to update the existing LDA models with additional documents.

For more detail - https://radimrehurek.com/gensim/models/ldamodel.html

  • I asked for the evaluation of multiple doc. not the updation. – Abhishek Sachan May 29 '17 at 08:25
  • @abhishek could you elaborate your doubt please, if you have multiple documents and you want to provide them as input at once, then you can stream the inputs using generator. update is used when you have to add docs after training a model. Please see the link I provided in answer above. Hope it helps – Prakhar Pratyush May 30 '17 at 08:53
0

I think what you are looking is this piece of code.

newData= [dictionary.doc2bow(text) for text in texts] #Where text is new data
newCorpus= lsa[vec_bow_jobs] #this is new corpus

sims=[]
for similarities in index[newCorpus]:
    sims.append(similarities)

#to get similarity with each document in the original corpus
sims=pd.DataFrame(np.array(sims)).transpose() 

However, cosine similarity is not the best thing to measure similarity using LDA model. Look for implementation of Jensen Shanon Distance. I found this code for that but didn`t make it work in my case. Lda similarity with Jensen Shanon Distance

Armin Alibasic
  • 261
  • 4
  • 9