0

I have two documents, for example:

Doc1 = {'python','numpy','machine learning'}
Doc2 = {'python','pandas','tensorflow','svm','regression','R'}

And I also know the similarity(correlation) of each pair of words, e.g

Sim('python','python') = 1
Sim('python','pandas') = 0.8
Sim('numpy', 'R') = 0.1

What is the best way to measure the similarity of the two documents?

It seems that the traditional Jaccard distance and cosine distance are not a good metric in this situation.

Sociopath
  • 13,068
  • 19
  • 47
  • 75
ken wang
  • 165
  • 1
  • 12
  • what is datatype of your documents? string or list? – Sociopath Aug 30 '18 at 06:54
  • @AkshayNevrekar you can just consider the document as a set of string as mentioned in the question , duplicates doesn't matter in my situation – ken wang Aug 30 '18 at 06:58
  • This is way too broad. You might want to try https://cs.stackexchange.com/ for this type of question, which is not really a Python question. – kabanus Aug 30 '18 at 06:59
  • @kabanus thx for reminding, I will repost my question to stackexchange – ken wang Aug 30 '18 at 07:02
  • @Ken.W No problem. Do not forget to delete it here once copied as to not create a cross site duplicate. – kabanus Aug 30 '18 at 07:03
  • Do you have word vectors for the individual words? – grshankar Aug 30 '18 at 07:53
  • @grshankar No, I have tired to use word vectors to calculate the similarity of two words but the effect was not ideal as expected. So I used another approach to define the similarity of two words – ken wang Aug 30 '18 at 12:00
  • what approach did you use for word similarity ? – grshankar Aug 30 '18 at 12:09
  • @grshankar e.g Sim('python','pandas') = co-appearance('python','pandas') in corpus / appearance('pandas') in corpus – ken wang Aug 30 '18 at 12:20

1 Answers1

0

I like a book by Peter Christen on this issue.

Here he describes a Monge-Elkan similarity measure between two sets of strings. For each word from the first set you find the closest word from the second set and divide it by the number of elements in the first set. You can see its description on page 30 here.

Denis Gordeev
  • 454
  • 2
  • 9