I have two documents, for example:
Doc1 = {'python','numpy','machine learning'}
Doc2 = {'python','pandas','tensorflow','svm','regression','R'}
And I also know the similarity
(correlation) of each pair of words, e.g
Sim('python','python') = 1
Sim('python','pandas') = 0.8
Sim('numpy', 'R') = 0.1
What is the best way to measure the similarity of the two documents?
It seems that the traditional Jaccard distance
and cosine distance
are not a good metric in this situation.