compare similarity between sets in python

Question

I have two sentences in python, that are represents sets of words the user gives in input as query for an image retrieval software:

sentence1 = "dog is the"
sentence2 = "the dog is a very nice animal"

I have a set of images that have a description, so for example:

sentence3 = "the dog is running in your garden"

I want to recover all the images that have a description "very close" to the query inserted by user, but this part related to description should be normalized between 0 and 1 since it is just a part of a more complex research which considers also geotagging and low level features of images.

Given that I create three sets using:

set_sentence1 = set(sentence1.split())
set_sentence2 = set(sentence2.split())
set_sentence3 = set(sentence3.split())

And compute the intersection between sets as:

intersection1 = set_sentence1.intersection(set_sentence3)
intersection2 = set_sentence2.intersection(set_sentence3)

How can i normalize efficiently the comparison?

I don't want to use levensthein distance, since I'm not interested in string similarity, but in set similarity.

a value in range [0,1], where 1 is the output if sets are equal, and 0 if their intersection is of size 0. The point is that the string may have differen sizes — user601836, Sep 12 '12 at 07:59
@user601836, okay, but what numbers are you expecting in your examples? 3/7 and 3/7? — Minras, Sep 12 '12 at 08:04
Can you explain the background of your task? The normalisation here can be done in dozens ways. Your normalisation pattern must reflect your expectations. — Maksym Polshcha, Sep 12 '12 at 08:06
Are you sure you need set similarity? This kind of task, if I understand it correctly, is more commonly handled by [treating texts as vectors](https://en.wikipedia.org/wiki/Vector_space_model) and using cosine similarity, e.g. http://stackoverflow.com/questions/12118720/python-tf-idf-cosine-to-find-document-similarity — Fred Foo, Sep 12 '12 at 09:02

iniju · Accepted Answer · 2012-09-12T08:55:29.633

3

maybe a metric like:

Similarity1 = (1.0 + len(intersection1))/(1.0 + max(len(set_sentence1), len(set_sentence3)))
Similarity2 = (1.0 + len(intersection2))/(1.0 + max(len(set_sentence2), len(set_sentence3)))

edited Sep 12 '12 at 08:55

answered Sep 12 '12 at 07:59

iniju

1,084
7
11

should it be len(set_sentence2) or len(set_sentence3) ? – user601836 Sep 12 '12 at 08:05

score 1 · Answer 2 · answered Sep 12 '12 at 09:25

have you tried difflib?

example from docs:

>>> s1 = ['bacon\n', 'eggs\n', 'ham\n', 'guido\n']
>>> s2 = ['python\n', 'eggy\n', 'hamster\n', 'guido\n']
>>> for line in context_diff(s1, s2, fromfile='before.py', tofile='after.py'):
...     sys.stdout.write(line)  
*** before.py
--- after.py
***************
*** 1,4 ****
! bacon
! eggs
! ham
  guido
--- 1,4 ----
! python
! eggy
! hamster
  guido

score 0 · Answer 3 · answered Aug 01 '22 at 16:34

0

We can try jaccard similarity. len(set A intersection set B) / len(set A union set B). More info at https://en.wikipedia.org/wiki/Jaccard_index

answered Aug 01 '22 at 16:34

prtkp

537
1
6
8

compare similarity between sets in python

3 Answers3