How to measure the similarity of two documents , given the similarity of each pair of words?

Question

I have two documents, for example:

Doc1 = {'python','numpy','machine learning'}
Doc2 = {'python','pandas','tensorflow','svm','regression','R'}

And I also know the similarity(correlation) of each pair of words, e.g

Sim('python','python') = 1
Sim('python','pandas') = 0.8
Sim('numpy', 'R') = 0.1

What is the best way to measure the similarity of the two documents?

It seems that the traditional Jaccard distance and cosine distance are not a good metric in this situation.

@AkshayNevrekar you can just consider the document as a set of string as mentioned in the question , duplicates doesn't matter in my situation — ken wang, Aug 30 '18 at 06:58
This is way too broad. You might want to try https://cs.stackexchange.com/ for this type of question, which is not really a Python question. — kabanus, Aug 30 '18 at 06:59
@kabanus thx for reminding, I will repost my question to stackexchange — ken wang, Aug 30 '18 at 07:02
@Ken.W No problem. Do not forget to delete it here once copied as to not create a cross site duplicate. — kabanus, Aug 30 '18 at 07:03
@grshankar No, I have tired to use word vectors to calculate the similarity of two words but the effect was not ideal as expected. So I used another approach to define the similarity of two words — ken wang, Aug 30 '18 at 12:00
@grshankar e.g Sim('python','pandas') = co-appearance('python','pandas') in corpus / appearance('pandas') in corpus — ken wang, Aug 30 '18 at 12:20

score 0 · Answer 1 · answered Aug 30 '18 at 15:14

I like a book by Peter Christen on this issue.

Here he describes a Monge-Elkan similarity measure between two sets of strings. For each word from the first set you find the closest word from the second set and divide it by the number of elements in the first set. You can see its description on page 30 here.

How to measure the similarity of two documents , given the similarity of each pair of words?

1 Answers1

Linked