I have a list: list = ['item1', 'item2', 'item3', 'item4']
I want to compare the similarity of all items.
If item2
and item3
is similar, the result become list = ['item1', 'item2', 'item4']
Edit:
Sorry for my confusing question.
list items is set of trigrams. I want to remove the similar item in a list.
list = [('very','beauty','place'),('very','good','place'),('another','trigram','item')]
with compute jaccard similarity every pairs-item in that list, if jaccard score of pairs-item > 0.4, i call it similar. In this example, item1 and item2 are similar. The last output i want is:
list = [('very','beauty','place'),('another','trigram','item')]
This is the method to calculate jaccard score:
def compute_jaccard_index(set_1, set_2):
n = len(set_1.intersection(set_2))
return n / float(len(set_1) + len(set_2) - n)