How can I calculate the Jaccard Similarity of two lists containing strings in Python?

Question

I have two lists with usernames and I want to calculate the Jaccard similarity. Is it possible?

This thread shows how to calculate the Jaccard Similarity between two strings, however I want to apply this to two lists, where each element is one word (e.g., a username).

score 38 · Accepted Answer · edited Aug 12 '21 at 09:39

38

I ended up writing my own solution after all:

def jaccard_similarity(list1, list2):
    intersection = len(list(set(list1).intersection(list2)))
    union = (len(set(list1)) + len(set(list2))) - intersection
    return float(intersection) / union

edited Aug 12 '21 at 09:39

Mina Melek

185
1
14

answered Oct 30 '17 at 13:47

Aventinus

1,322
2
15
33

4

The function will always return 0.0 – xyd Jul 27 '18 at 17:35
@xyd Works perfect for me. Can you please explain? – Aventinus Nov 12 '19 at 10:55
Worth noting this calculation is different than the answer by @w2bo as this one does not divide by the set length union. – Union find Dec 03 '19 at 21:14
This answer is wrong. For example, `jaccard_similarity([1], [0, 1])` -> `0.5` and `jaccard_similarity([1, 1], [0, 1, 1])` -> `0.25` however second one should be as similar or more similar than first one based on how you define the jaccard. – Muhammed Hasan Celik Jan 05 '21 at 18:34
3

The solution is simple and elegant, but not 100% correct. You should change the corresponding line to : `union = (len(set(list1)) + len(set(list2))) - intersection` – Amir Feb 01 '21 at 08:40

score 30 · Answer 2 · edited Oct 27 '20 at 13:01

30

For Python 3:

def jaccard_similarity(list1, list2):
    s1 = set(list1)
    s2 = set(list2)
    return float(len(s1.intersection(s2)) / len(s1.union(s2)))
list1 = ['dog', 'cat', 'cat', 'rat']
list2 = ['dog', 'cat', 'mouse']
jaccard_similarity(list1, list2)
>>> 0.5

For Python2 use return len(s1.intersection(s2)) / float(len(s1.union(s2)))

edited Oct 27 '20 at 13:01

High Performance Rangsiman

599
7
14

answered Aug 28 '18 at 13:34

w4bo

855
7
14

4

This will also give 0.0 as result. Return statement should be modified : return float(len(s1.intersection(s2))) / float(len(s1.union(s2))) – Shalini Baranwal May 13 '19 at 09:35
For Python2 use: `return float(len(s1.intersection(s2))) / len(s1.union(s2))` – seralouk Jul 31 '19 at 10:00

score 14 · Answer 3 · answered Jun 13 '18 at 18:02

14

@aventinus I don't have enough reputation to add a comment to your answer, but just to make things clearer, your solution measures the jaccard_similarity but the function is misnamed as jaccard_distance, which is actually 1 - jaccard_similarity

answered Jun 13 '18 at 18:02

iamlcc

329
2
9

1

Thank you for the tip! I did not know that. I edited the answer accordingly. – Aventinus Jun 13 '18 at 21:45

score 7 · Answer 4 · answered Oct 27 '17 at 13:43

Assuming your usernames don't repeat, you can use the same idea:

def jaccard(a, b):
    c = a.intersection(b)
    return float(len(c)) / (len(a) + len(b) - len(c))

list1 = ['dog', 'cat', 'rat']
list2 = ['dog', 'cat', 'mouse']
# The intersection is ['dog', 'cat']
# union is ['dog', 'cat', 'rat', 'mouse]
words1 = set(list1)
words2 = set(list2)
jaccard(words1, words2)
>>> 0.5

score 3 · Answer 5 · answered Dec 14 '18 at 15:13

3

You can use the Distance library

#pip install Distance

import distance

distance.jaccard("decide", "resize")

# Returns
0.7142857142857143

answered Dec 14 '18 at 15:13

LaSul

2,231
1
20
36

This answer describes how to get the Jaccard similarity between two strings which is not what this question is about. – Aventinus Sep 28 '22 at 08:25

Erwin Scholtens · Answer 6 · 2019-06-26T14:05:32.533

@Aventinus (I also cannot comment): Note that Jaccard similarity is an operation on sets, so in the denominator part it should also use sets (instead of lists). So for example jaccard_similarity('aa', 'ab') should result in 0.5.

def jaccard_similarity(list1, list2):
    intersection = len(set(list1).intersection(list2))
    union = len(set(list1)) + len(set(list2)) - intersection

    return intersection / union

Note that in the intersection, there is no need to cast to list first. Also, the cast to float is not needed in Python 3.

Brian Risk · Answer 7 · 2022-11-12T16:17:56.787

Creator of the Simphile NLP text similarity package here. Simphile contains several text similarity methods, Jaccard being one of them.

In the terminal install the package:

pip install simphile

Then your code could be something like:

from simphile import jaccard_list_similarity

list_a = ['cat', 'cat', 'dog']
list_b = ['dog', 'dog', 'cat']

print(f"Jaccard Similarity: {jaccard_list_similarity(list_a, list_b)}")

The output being:

Jaccard Similarity: 0.5

Note that this solution accounts for repeated elements -- critical for text similarity; without it, the above example would show 100% similarity due to the fact that both lists as sets would reduce to {'dog', 'cat'}.

kd88 · Answer 8 · 2019-04-26T15:07:05.137

1

If you'd like to include repeated elements, you can use Counter, which I would imagine is relatively quick since it's just an extended dict under the hood:

from collections import Counter
def jaccard_repeats(a, b):
    """Jaccard similarity measure between input iterables,
    allowing repeated elements"""
    _a = Counter(a)
    _b = Counter(b)
    c = (_a - _b) + (_b - _a)
    n = sum(c.values())
    return n/(len(a) + len(b) - n)

list1 = ['dog', 'cat', 'rat', 'cat']
list2 = ['dog', 'cat', 'rat']
list3 = ['dog', 'cat', 'mouse']     

jaccard_repeats(list1, list3)      
>>> 0.75

jaccard_repeats(list1, list2) 
>>> 0.16666666666666666

jaccard_repeats(list2, list3)  
>>> 0.5

edited Apr 26 '19 at 15:07

answered Dec 14 '18 at 14:53

kd88

1,054
10
21

I think this solution is not correct as regards repeated items. However, it works ok for lists with non-repeated items. – AlessioX Feb 20 '19 at 07:37
I think that this is distance, so if one want similarity, '1 - ' should be removed from return line. – Tedo Vrbanec Apr 26 '19 at 12:49

score 1 · Answer 9 · answered Apr 26 '21 at 15:05

To avoid repetition of elements in the union (denominator), and a little bit faster I propose:

def Jaccar_score(lista1, lista2):    
    inter = len(list(set(lista_1) & set(lista_2)))
    union = len(list(set(lista_1) | set(lista_2)))
    return inter/union

How can I calculate the Jaccard Similarity of two lists containing strings in Python?

9 Answers9

Linked