Finding the most frequent occurrences of pairs in a list of lists

Question

I've a dataset that denotes the list of authors of many technical reports. Each report can be authored by one or multiple people:

a = [
['John', 'Mark', 'Jennifer'],
['John'],
['Joe', 'Mark'],
['John', 'Anna', 'Jennifer'],
['Jennifer', 'John', 'Mark']
]

I've to find the most frequent pairs, that is, people that had most collaborations in the past:

['John', 'Jennifer'] - 3 times
['John', 'Mark'] - 2 times
['Mark', 'Jennifer'] - 2 times
etc...

How to do this in Python?

Uh, I'd probably start with some sort of [Counter](https://docs.python.org/2/library/collections.html#collections.Counter) — NightShadeQueen, Jul 18 '15 at 20:57
Unfortunately nothing, I don't even know how to start here. The only approach I can think of is to construct a huge array, but that I guess that there are more efficient ways — paginated, Jul 18 '15 at 20:58
I would 1) parse the array 1.2) for each list find all pairs, 1.3) put each pair in a key of a dict and value 0. 2) loop again over the array 2.1) find all pairs 2.3) for each pair increment the counter in the dict — Antonio Ragagnin, Jul 18 '15 at 21:01

score 12 · Accepted Answer · edited Jul 18 '15 at 21:12

12

Use a collections.Counter dict with itertools.combinations:

from collections import Counter
from itertools import combinations

d  = Counter()
for sub in a:
    if len(a) < 2:
        continue
    sub.sort()
    for comb in combinations(sub,2):
        d[comb] += 1

print(d.most_common())
[(('Jennifer', 'John'), 3), (('John', 'Mark'), 2), (('Jennifer', 'Mark'), 2), (('Anna', 'John'), 1), (('Joe', 'Mark'), 1), (('Anna', 'Jennifer'), 1)]

most_common() will return the pairings in order of most common to least, of you want the first n most common just pass n d.most_common(n)

edited Jul 18 '15 at 21:12

Mazdak

105,000
18
159
188

answered Jul 18 '15 at 21:02

Padraic Cunningham

176,452
29
245
321

Thanks. I also liked other solutions, but this one also sorts the data. – paginated Jul 18 '15 at 21:21
@paginated, no worries, a Counter dict is pretty much tailor-made for what you want – Padraic Cunningham Jul 18 '15 at 21:22
1

Shouldn't it be `if len(sub) < 2:`, not `if len(a) < 2:`? – Nick Dec 22 '20 at 05:46

score 4 · Answer 2 · answered Jul 18 '15 at 21:01

import collections
import itertools

a = [
['John', 'Mark', 'Jennifer'],
['John'],
['Joe', 'Mark'],
['John', 'Anna', 'Jennifer'],
['Jennifer', 'John', 'Mark']
]


counts = collections.defaultdict(int)
for collab in a:
    collab.sort()
    for pair in itertools.combinations(collab, 2):
        counts[pair] += 1

for pair, freq in counts.items():
    print(pair, freq)

Output:

('John', 'Mark') 2
('Jennifer', 'Mark') 2
('Anna', 'John') 1
('Jennifer', 'John') 3
('Anna', 'Jennifer') 1
('Joe', 'Mark') 1

Mazdak · Answer 3 · 2015-07-18T21:11:11.277

1

You can use a set comprehension to create a set of all numbers then use a list comprehension to count the occurrence of the pair names in your sub list :

>>> from itertools import combinations as comb
>>> all_nam={j for i in a for j in i}
>>> [[(i,j),sum({i,j}.issubset(t) for t in a)] for i,j in comb(all_nam,2)]

[[('Jennifer', 'John'), 3], 
 [('Jennifer', 'Joe'), 0], 
 [('Jennifer', 'Anna'), 1], 
 [('Jennifer', 'Mark'), 2], 
 [('John', 'Joe'), 0], 
 [('John', 'Anna'), 1], 
 [('John', 'Mark'), 2], 
 [('Joe', 'Anna'), 0], 
 [('Joe', 'Mark'), 1], 
 [('Anna', 'Mark'), 0]]

edited Jul 18 '15 at 21:11

answered Jul 18 '15 at 21:02

Mazdak

105,000
18
159
188

1

`sum({i,j}.issubset(t))...` will do the same thing – Padraic Cunningham Jul 18 '15 at 21:09

Finding the most frequent occurrences of pairs in a list of lists

3 Answers3

Linked