find the "overlap" between 2 python lists

Question

Given 2 lists:

a = [3,4,5,5,5,6]
b = [1,3,4,4,5,5,6,7]

I want to find the "overlap":

c = [3,4,5,5,6]

I'd also like it if i could extract the "remainder" the part of a and b that's not in c.

a_remainder = [5,]
b_remainder = [1,4,7,]

Note: a has three 5's in it and b has two. b has two 4's in it and a has one.

The resultant list c should have two 5's (limited by list b) and one 4 (limited by list a).

This gives me what i want, but I can't help but think there's a much better way.

import copy

a = [3,4,5,5,5,6]
b = [1,3,4,4,5,5,6,7]

c = []
for elem in copy.deepcopy(a):
    if elem in b:
        a.pop(a.index(elem))
        c.append(b.pop(b.index(elem)))

# now a and b both contain the "remainders" and c contains the "overlap"

On another note, what is a more accurate name for what I'm asking for than "overlap" and "remainder"?

In set theory, "overlap" is called the intersection, and "remainder" is called the difference. Unfortunately, you can't do this stuff in a single line with Python sets because you have repeated values. — nmichaels, Feb 23 '11 at 16:58
Andrew Jaffe: You're correct. I will adjust the question to avoid confusion in case any other person comes along. — user625477, Feb 23 '11 at 21:24
nmichaels and Rosh Oxymoron: Thank you. I was thinking that was dumb, but wasn't sure how to approach it. — user625477, Feb 23 '11 at 21:24

score 23 · Accepted Answer · edited Sep 26 '12 at 04:38

collection.Counter available in Python 2.7 can be used to implement multisets that do exactly what you want.

a = [3,4,5,5,5,6]
b = [1,3,4,4,5,5,6,7]

a_multiset = collections.Counter(a)
b_multiset = collections.Counter(b)

overlap = list((a_multiset & b_multiset).elements())
a_remainder = list((a_multiset - b_multiset).elements())
b_remainder = list((b_multiset - a_multiset).elements())

print overlap, a_remainder, b_remainder

score 16 · Answer 2 · answered Feb 23 '11 at 17:04

16

Use python set

intersection = set(a) & set(b)
a_remainder = set(a) - set(b)
b_remainder = set(b) - set(a)

answered Feb 23 '11 at 17:04

kefeizhou

6,234
10
42
55

2

This is a nice idea but the person's data is not a valid set and has dupes and considers that too. – eapen Feb 23 '11 at 17:51

dfb · Answer 3 · 2011-02-23T17:05:56.500

In the language of sets, overlap is 'intersection' and remainder is 'set difference'. If you had distinct items, you wouldn't have to do these operations yourself, check out http://docs.python.org/library/sets.html if you're interested.

Since we're not working with distinct elements, your approach is reasonable. If you wanted this to run faster, you could create a dictionary for each list and map the number to how many elements are in each array (e.g., in a, 3->1, 4->1, 5->2, etc.). You would then iterate through map a, determine if that letter existed, decrement its count and add it to the new list

Untested code, but this is the idea

def add_or_update(map,value):
    if value in map:
        map[value]+=1
    else
        map[value]=1

b_dict = dict()
for b_elem in b:
    add_or_update(b_dict,b_elem)

intersect = []; diff = [];

for a_elem in a:
    if a_elem in b_dict and b_dict[a_elem]>0:
        intersect.add(a_elem);

for k,v in diff:
    for i in range(v):
        diff.add(k);

Andrew Jaffe · Answer 4 · 2011-02-23T18:57:54.073

OK, verbose, but kind of cool (similar in spirit to the collections.Counter idea, but more home-made):

import itertools as it
flatten = it.chain.from_iterable 
sorted(
   v for u,v in 
    set(flatten(enumerate(g) 
        for k, g in it.groupby(a))).intersection(
    set(flatten(enumerate(g)
        for k, g in it.groupby(b))))
   )

The basic idea is to make each of the lists into a new list which attaches a counter to each object, numbered to account for duplicates -- so that then you can then use set operations on these tuples after all.

To be slightly less verbose:

 aa = set(flatten(enumerate(g) for k, g in it.groupby(a)))
 bb = set(flatten(enumerate(g) for k, g in it.groupby(b)))
 # aa = set([(0, 3), (0, 4), (0, 5), (0, 6), (1, 5), (2, 5)])
 # bb = set([(0, 1), (0, 3), (0, 4), (0, 5), (0, 6), (0, 7), (1, 4), (1, 5)])

 cc = aa.intersection(bb)
 # cc = set([(0, 3), (0, 4), (0, 5), (0, 6), (1, 5)])
 c = sorted(v for u,v in cc)
 # c = [3, 4, 5, 5, 6]

groupby -- produces a list of lists containing identical elements (but because of the syntax needs the g for k,g in it.groupby(a) to extract each list)
enumerate -- appends a counter to each element of each sublist
flatten -- create a single list
set -- convert to a set
intersection -- find the common elements
sorted(v for u,v in cc) -- get rid of the counters and sort the result

Finally, I'm not sure what you mean by the remainders; it seems like it ought to be my aa-cc and bb-cc but I don't know where you get a_remainder = [4]:

sorted(v for u,v in aa-cc)
# [5]

sorted(v for u,v in bb-cc)
# [1, 4, 7]

user625477 · Answer 5 · 2011-02-23T18:15:00.647

2

A response from kerio in #python on freenode:

[ i for i in itertools.chain.from_iterable([k] * v for k, v in \
  (Counter(a) & Counter(b)).iteritems())
]

edited Feb 23 '11 at 18:15

answered Feb 23 '11 at 17:34

user625477

3,086
2
15
8

score 1 · Answer 6 · answered Feb 24 '11 at 09:54

Try difflib.SequenceMatcher(), "a flexible class for comparing pairs of sequences of any type"...

A quick try:

a = [3,4,5,5,5,6]
b = [1,3,4,4,5,5,6,7]

sm = difflib.SequenceMatcher(None, a, b)
c = []
a_remainder = []
b_remainder = []

for tag, i1, i2, j1, j2 in sm.get_opcodes():
    if tag == 'replace':
        a_remainder.extend(a[i1:i2])
        b_remainder.extend(b[j1:j2])
    elif tag == 'delete':
        a_remainder.extend(a[i1:i2])
    elif tag == 'insert':
        b_remainder.extend(b[j1:j2])
    elif tag == 'equal':
        c.extend(a[i1:i2])

And now...

>>> print c
[3, 4, 5, 5, 6]
>>> print a_remainder
[5]
>>> print b_remainder
[1, 4, 7]

score 0 · Answer 7 · answered Feb 23 '11 at 17:03

0

Aset = Set(a);
Bset = Set(b);
a_remainder = a.difference(b);
b_remainder = b.difference(a);
c = a.intersection(b);

But if you need c to have duplicates, and order is important for you,
you may look for w:Longest common subsequence problem

answered Feb 23 '11 at 17:03

kirilloid

14,011
6
38
52

I dont know why this got downvoted. This worked for me. – KarthikS Apr 22 '20 at 23:58

Andrew Clark · Answer 8 · 2011-02-23T18:14:46.827

I don't think you should actually use this solution, but I took this opportunity to practice with lambda functions and here is what I came up with :)

a = [3,4,5,5,5,6]
b = [1,3,4,4,5,5,6,7]
dedup = lambda x: [set(x)] if len(set(x)) == len(x) else [set(x)] + dedup([x[i] for i in range(1, len(x)) if x[i] == x[i-1]])
default_set = lambda x: (set() if x[0] is None else x[0], set() if x[1] is None else x[1])
deduped = map(default_set, map(None, dedup(a), dedup(b)))
get_result = lambda f: reduce(lambda x, y: list(x) + list(y), map(lambda x: f(x[0], x[1]), deduped))
c = get_result(lambda x, y: x.intersection(y))          # [3, 4, 5, 6, 5]
a_remainder = get_result(lambda x, y: x.difference(y))  # [5]
b_remainder = get_result(lambda x, y: y.difference(x))  # [1, 7, 4]

I'm pretty sure izip_longest would have simplified this a bit (wouldn't have needed the default_set lambda), but I was testing this with Python 2.5.

Here are some of the intermediate values used in the calculation in case anyone wants to understand this:

dedup(a) = [set([3, 4, 5, 6]), set([5]), set([5])]
dedup(b) = [set([1, 3, 4, 5, 6, 7]), set([4, 5])]
deduped = [(set([3, 4, 5, 6]), set([1, 3, 4, 5, 6, 7])), (set([5]), set([4, 5])), (set([5]), set([]))]

Out of curiosity, why the downvote? Unlike other downvoted answers mine gets the correct solution. I know it is hard to read but that is why the disclaimer is there. I only posted this because I thought others would be interested in a solution that uses set operations, even if it is a bit obscure. — Andrew Clark, Feb 23 '11 at 18:25
A solution is only correct if you can independently verify it. Otherwise you're relying on the author and the few examples that you can think of to test it. Your solution is an excellent mind exercise for the reader, but you don't want the code verification phase to be longer than the code writing phase. ;) — Rosh Oxymoron, Feb 23 '11 at 18:56

find the "overlap" between 2 python lists

8 Answers8

Linked