I have two sentences in python, that are represents sets of words the user gives in input as query for an image retrieval software:
sentence1 = "dog is the"
sentence2 = "the dog is a very nice animal"
I have a set of images that have a description, so for example:
sentence3 = "the dog is running in your garden"
I want to recover all the images that have a description "very close" to the query inserted by user, but this part related to description should be normalized between 0 and 1 since it is just a part of a more complex research which considers also geotagging and low level features of images.
Given that I create three sets using:
set_sentence1 = set(sentence1.split())
set_sentence2 = set(sentence2.split())
set_sentence3 = set(sentence3.split())
And compute the intersection between sets as:
intersection1 = set_sentence1.intersection(set_sentence3)
intersection2 = set_sentence2.intersection(set_sentence3)
How can i normalize efficiently the comparison?
I don't want to use levensthein distance, since I'm not interested in string similarity, but in set similarity.