1

I have a dictionary that looks like this:

my_dict = {'Community A': ['User 1', 'User 2', 'User 3'],
           'Community B': ['User 1', 'User 2'],
           'Community C': ['User 3', 'User 4', 'User 5'],
           'Community D': ['User 1', 'User 3', 'User 4', 'User 5']}

My goal is to model the networked relations between the different communities and their sets of unique users to see which communities are most similar. Currently, I am am exploring using Jaccard similarity.

I have come across answers that do similar operations, but only on exactly 2 dictionaries; in my case, I have several, and will need to calculate the similarities between each set.

Also, some of the lists are of different lengths: in other answers, I saw 0 sub in as a missing value in that case, which I think will work in my case.

bad_coder
  • 11,289
  • 20
  • 44
  • 72
n0ro
  • 477
  • 4
  • 11
  • Jaccard is normally pairwise... some ideas here perhaps https://stackoverflow.com/questions/2035326/computing-degree-of-similarity-among-a-group-of-sets –  Nov 04 '19 at 15:02
  • Does this answer your question? [How to compute jaccard similarity from a pandas dataframe](https://stackoverflow.com/questions/37003272/how-to-compute-jaccard-similarity-from-a-pandas-dataframe) – Jab Nov 04 '19 at 15:09

1 Answers1

2

What you need is the matrix of the Jaccard similarities. You can store them as a dict if this is what you like indexed by a tuple (groupA, groupB).

A simple implementation is following

def jaccard(first, second):
    return len(set(first).intersection(second)) / len(set(first).union(second))

keys = list(my_dict.keys())
result_dict = {}

for k in keys:
    for l in keys:
        result_dict[(k,l)] = result_dict.get((l,k), jaccard(my_dict[k], my_dict[l]))

Then the resulting dict looks like

print(result_dict)
{('Community A', 'Community A'): 1.0, ('Community A', 'Community B'): 0.6666666666666666, ('Community A', 'Community C'): 0.2, ('Community A', 'Community D'): 0.4, ('Community B', 'Community A'): 0.6666666666666666, ('Community B', 'Community B'): 1.0, ('Community B', 'Community C'): 0.0, ('Community B', 'Community D'): 0.2, ('Community C', 'Community A'): 0.2, ('Community C', 'Community B'): 0.0, ('Community C', 'Community C'): 1.0, ('Community C', 'Community D'): 0.75, ('Community D', 'Community A'): 0.4, ('Community D', 'Community B'): 0.2, ('Community D', 'Community C'): 0.75, ('Community D', 'Community D'): 1.0}

Where obviously the diagonal elements are the identity.

Explain The get function checks if the pair has been computed otherwise it does the calculation

kosnik
  • 2,342
  • 10
  • 23