Computing Jaccard similarity on multiple dictionaries in Python?

Question

I have a dictionary that looks like this:

my_dict = {'Community A': ['User 1', 'User 2', 'User 3'],
           'Community B': ['User 1', 'User 2'],
           'Community C': ['User 3', 'User 4', 'User 5'],
           'Community D': ['User 1', 'User 3', 'User 4', 'User 5']}

My goal is to model the networked relations between the different communities and their sets of unique users to see which communities are most similar. Currently, I am am exploring using Jaccard similarity.

I have come across answers that do similar operations, but only on exactly 2 dictionaries; in my case, I have several, and will need to calculate the similarities between each set.

Also, some of the lists are of different lengths: in other answers, I saw 0 sub in as a missing value in that case, which I think will work in my case.

Jaccard is normally pairwise... some ideas here perhaps https://stackoverflow.com/questions/2035326/computing-degree-of-similarity-among-a-group-of-sets — , Nov 04 '19 at 15:02
Does this answer your question? [How to compute jaccard similarity from a pandas dataframe](https://stackoverflow.com/questions/37003272/how-to-compute-jaccard-similarity-from-a-pandas-dataframe) — Jab, Nov 04 '19 at 15:09

score 2 · Accepted Answer · answered Nov 04 '19 at 15:09

What you need is the matrix of the Jaccard similarities. You can store them as a dict if this is what you like indexed by a tuple (groupA, groupB).

A simple implementation is following

def jaccard(first, second):
    return len(set(first).intersection(second)) / len(set(first).union(second))

keys = list(my_dict.keys())
result_dict = {}

for k in keys:
    for l in keys:
        result_dict[(k,l)] = result_dict.get((l,k), jaccard(my_dict[k], my_dict[l]))

Then the resulting dict looks like

print(result_dict)
{('Community A', 'Community A'): 1.0, ('Community A', 'Community B'): 0.6666666666666666, ('Community A', 'Community C'): 0.2, ('Community A', 'Community D'): 0.4, ('Community B', 'Community A'): 0.6666666666666666, ('Community B', 'Community B'): 1.0, ('Community B', 'Community C'): 0.0, ('Community B', 'Community D'): 0.2, ('Community C', 'Community A'): 0.2, ('Community C', 'Community B'): 0.0, ('Community C', 'Community C'): 1.0, ('Community C', 'Community D'): 0.75, ('Community D', 'Community A'): 0.4, ('Community D', 'Community B'): 0.2, ('Community D', 'Community C'): 0.75, ('Community D', 'Community D'): 1.0}

Where obviously the diagonal elements are the identity.

Explain The get function checks if the pair has been computed otherwise it does the calculation

Computing Jaccard similarity on multiple dictionaries in Python?

1 Answers1

Linked