I have a dataset that is a list of lists of IDs. Each list can have one or more unique IDs.
a = [
['123','456','789'],
['123'],
['123','456'],
['456','789']
]
I want to calculate the following value for each pair of IDs: Value(id1,id2) = # of lists with both id1,id2 / # of lists with either id1 or id2 The expected result for the example list a above is as follows:
Value('123','456') = 0.5
Value('123','789') = 0.25
Value('456','789') = 0.66
Edit: I should have been more specific from the get-go, I would actually like to get this result in matrix form based on a pre-specified ordering of IDs (for instance '123','456','789'):
Value = array([[1, 0.5, 0.25],
[0.5, 1, 0.66],
[0.25, 0.66, 1]])
The matrix will be symmetric.
The approach I've tried is making two dictionaries of
- counts for each pair of IDs as described by the top answer in this post: Finding the most frequent occurrences of pairs in a list of lists.
- counts for each individual ID
from collections import Counter
from itertools import combinations
d = Counter()
s = Counter()
for lst in a:
for id in lst:
s[id] += 1
if len(lst) < 2:
continue
lst.sort()
for comb in combinations(lst,2):
d[comb] += 1
I then created a final dictionary that applies the Value function above to each pair of IDs using the two dictionaries s and d in a for loop.
finaldict={k: d[k]/(s[k[0]]+s[k[1]]-d[k]) for k in d.keys()}
This part is taking a very long time as the number of unique IDs is very large (approx 100,000) and I wonder if there is a quicker way to do this.
Edit: Assuming I get this finaldict created, I would then have to find a way to change it into the desired matrix form.