Given two documents, I wish to calculate the similarity between them. I have measures to find out the cosine distance, N-Gram and tf-idf using this: This is a previously asked question
I wish to know, what further needs to be done using these functions.
Also, I have tried implementing Word2Vec, following which I tried finding similarities using the following code:
for i in range(len(Words)):
print i
for k in range(len(Words)):
net_sim = 0.0
for j in range(len(Words.ix[i]['A'])):
sim = 0.0
for l in range(len(Words.ix[k]['A'])):
if sim < model.similarity(Words.ix[i]['A'][j],Words.ix[k]['A'][l]):
sim = model.similarity(Words.ix[i]['A'][j],Words.ix[k]['A'][l])
net_sim += sim
Similarity.ix[i][k] = net_sim/len(Words.ix[i]['A'])
For ever word in a given document, I try to find the most similar word in the second document and add their similarity. Then, I divide by the number of words, in order to normalize it to a range of 0 to 1. Here, Words is a DataFrame, consisting of words of different documents, in separate rows and model is a Word2Vec model. This process takes a lot of time and I wish to optimize it and thereby looking for different approaches