1

I have a dataset that is a list of lists of IDs. Each list can have one or more unique IDs.

a = [
['123','456','789'],
['123'],
['123','456'],
['456','789']
]

I want to calculate the following value for each pair of IDs: Value(id1,id2) = # of lists with both id1,id2 / # of lists with either id1 or id2 The expected result for the example list a above is as follows:

Value('123','456') = 0.5
Value('123','789') = 0.25
Value('456','789') = 0.66

Edit: I should have been more specific from the get-go, I would actually like to get this result in matrix form based on a pre-specified ordering of IDs (for instance '123','456','789'):

Value = array([[1, 0.5, 0.25],
               [0.5, 1, 0.66],
               [0.25, 0.66, 1]])

The matrix will be symmetric.

The approach I've tried is making two dictionaries of

  1. counts for each pair of IDs as described by the top answer in this post: Finding the most frequent occurrences of pairs in a list of lists.
  2. counts for each individual ID
from collections import Counter
from itertools import combinations
d = Counter()
s = Counter()
for lst in a:
    for id in lst:
        s[id] += 1
    if len(lst) < 2:
        continue
    lst.sort()
    for comb in combinations(lst,2):
        d[comb] += 1

I then created a final dictionary that applies the Value function above to each pair of IDs using the two dictionaries s and d in a for loop.

finaldict={k: d[k]/(s[k[0]]+s[k[1]]-d[k]) for k in d.keys()}

This part is taking a very long time as the number of unique IDs is very large (approx 100,000) and I wonder if there is a quicker way to do this.

Edit: Assuming I get this finaldict created, I would then have to find a way to change it into the desired matrix form.

Slug
  • 13
  • 3
  • Show the code that's taking a long time. – Barmar Aug 18 '23 at 17:00
  • @Barmar I've added the final dictionary code. Please also see my edited comments. – Slug Aug 18 '23 at 17:38
  • Your output (the matrix) will have 10^10 entries, that's going to take long however you do it, although likely exhaust your memory first and crash. You likely need to change your goal. – Kelly Bundy Aug 19 '23 at 11:38

1 Answers1

0

You can use set operations for the task:

from itertools import combinations

a = [["123", "456", "789"], ["123"], ["123", "456"], ["456", "789"]]

d = {}
for i, l in enumerate(a):
    for v in l:
        d.setdefault(v, set()).add(i)

for n1, n2 in combinations(d, 2):
    i = d[n1].intersection(d[n2])
    u = d[n1].union(d[n2])

    print(n1, n2, len(i) / len(u))

Prints:

123 456 0.5
123 789 0.25
456 789 0.6666666666666666

EDIT: To create a symmetrical DataFrame:

from itertools import combinations

from pandas import DataFrame

a = [["123", "456", "789"], ["123"], ["123", "456"], ["456", "789"]]

d = {}
for i, l in enumerate(a):
    for v in l:
        d.setdefault(v, set()).add(i)

out = []
for n1, n2 in combinations(d, 2):
    i = d[n1].intersection(d[n2])
    u = d[n1].union(d[n2])
    out.append((n1, n2, len(i) / len(u)))

df = pd.DataFrame()
for k1, k2, v in out:
    df.loc[k1, k2] = v
    df.loc[k2, k1] = v

df = df.fillna(1).sort_index(axis=0).sort_index(axis=1)
print(df)

Prints:

      123       456       789
123  1.00  0.500000  0.250000
456  0.50  1.000000  0.666667
789  0.25  0.666667  1.000000
Andrej Kesely
  • 168,389
  • 15
  • 48
  • 91
  • Thank you for this, it seems to be working quicker than my approach. I wasn't specific enough in my original post. Is there a way to get the result in symmetric matrix form as I describe in my edited post? – Slug Aug 18 '23 at 17:52
  • @Slug I've updated my answer. – Andrej Kesely Aug 18 '23 at 18:05