20

Given 2 lists:

a = [3,4,5,5,5,6]
b = [1,3,4,4,5,5,6,7]

I want to find the "overlap":

c = [3,4,5,5,6]

I'd also like it if i could extract the "remainder" the part of a and b that's not in c.

a_remainder = [5,]
b_remainder = [1,4,7,]

Note: a has three 5's in it and b has two. b has two 4's in it and a has one.

The resultant list c should have two 5's (limited by list b) and one 4 (limited by list a).

This gives me what i want, but I can't help but think there's a much better way.

import copy

a = [3,4,5,5,5,6]
b = [1,3,4,4,5,5,6,7]

c = []
for elem in copy.deepcopy(a):
    if elem in b:
        a.pop(a.index(elem))
        c.append(b.pop(b.index(elem)))

# now a and b both contain the "remainders" and c contains the "overlap"

On another note, what is a more accurate name for what I'm asking for than "overlap" and "remainder"?

MERose
  • 4,048
  • 7
  • 53
  • 79
user625477
  • 3,086
  • 2
  • 15
  • 8

8 Answers8

23

collection.Counter available in Python 2.7 can be used to implement multisets that do exactly what you want.

a = [3,4,5,5,5,6]
b = [1,3,4,4,5,5,6,7]

a_multiset = collections.Counter(a)
b_multiset = collections.Counter(b)

overlap = list((a_multiset & b_multiset).elements())
a_remainder = list((a_multiset - b_multiset).elements())
b_remainder = list((b_multiset - a_multiset).elements())

print overlap, a_remainder, b_remainder
Sasha Chedygov
  • 127,549
  • 26
  • 102
  • 115
Rosh Oxymoron
  • 20,355
  • 6
  • 41
  • 43
16

Use python set

intersection = set(a) & set(b)
a_remainder = set(a) - set(b)
b_remainder = set(b) - set(a)
kefeizhou
  • 6,234
  • 10
  • 42
  • 55
  • 2
    This is a nice idea but the person's data is not a valid set and has dupes and considers that too. – eapen Feb 23 '11 at 17:51
6

In the language of sets, overlap is 'intersection' and remainder is 'set difference'. If you had distinct items, you wouldn't have to do these operations yourself, check out http://docs.python.org/library/sets.html if you're interested.

Since we're not working with distinct elements, your approach is reasonable. If you wanted this to run faster, you could create a dictionary for each list and map the number to how many elements are in each array (e.g., in a, 3->1, 4->1, 5->2, etc.). You would then iterate through map a, determine if that letter existed, decrement its count and add it to the new list

Untested code, but this is the idea

def add_or_update(map,value):
    if value in map:
        map[value]+=1
    else
        map[value]=1

b_dict = dict()
for b_elem in b:
    add_or_update(b_dict,b_elem)

intersect = []; diff = [];

for a_elem in a:
    if a_elem in b_dict and b_dict[a_elem]>0:
        intersect.add(a_elem);

for k,v in diff:
    for i in range(v):
        diff.add(k);
dfb
  • 13,133
  • 2
  • 31
  • 52
4

OK, verbose, but kind of cool (similar in spirit to the collections.Counter idea, but more home-made):

import itertools as it
flatten = it.chain.from_iterable 
sorted(
   v for u,v in 
    set(flatten(enumerate(g) 
        for k, g in it.groupby(a))).intersection(
    set(flatten(enumerate(g)
        for k, g in it.groupby(b))))
   )

The basic idea is to make each of the lists into a new list which attaches a counter to each object, numbered to account for duplicates -- so that then you can then use set operations on these tuples after all.

To be slightly less verbose:

 aa = set(flatten(enumerate(g) for k, g in it.groupby(a)))
 bb = set(flatten(enumerate(g) for k, g in it.groupby(b)))
 # aa = set([(0, 3), (0, 4), (0, 5), (0, 6), (1, 5), (2, 5)])
 # bb = set([(0, 1), (0, 3), (0, 4), (0, 5), (0, 6), (0, 7), (1, 4), (1, 5)])

 cc = aa.intersection(bb)
 # cc = set([(0, 3), (0, 4), (0, 5), (0, 6), (1, 5)])
 c = sorted(v for u,v in cc)
 # c = [3, 4, 5, 5, 6]
  • groupby -- produces a list of lists containing identical elements (but because of the syntax needs the g for k,g in it.groupby(a) to extract each list)
  • enumerate -- appends a counter to each element of each sublist
  • flatten -- create a single list
  • set -- convert to a set
  • intersection -- find the common elements
  • sorted(v for u,v in cc) -- get rid of the counters and sort the result

Finally, I'm not sure what you mean by the remainders; it seems like it ought to be my aa-cc and bb-cc but I don't know where you get a_remainder = [4]:

sorted(v for u,v in aa-cc)
# [5]

sorted(v for u,v in bb-cc)
# [1, 4, 7]
Andrew Jaffe
  • 26,554
  • 4
  • 50
  • 59
2

A response from kerio in #python on freenode:

[ i for i in itertools.chain.from_iterable([k] * v for k, v in \
  (Counter(a) & Counter(b)).iteritems())
]
user625477
  • 3,086
  • 2
  • 15
  • 8
1

Try difflib.SequenceMatcher(), "a flexible class for comparing pairs of sequences of any type"...

A quick try:

a = [3,4,5,5,5,6]
b = [1,3,4,4,5,5,6,7]

sm = difflib.SequenceMatcher(None, a, b)
c = []
a_remainder = []
b_remainder = []

for tag, i1, i2, j1, j2 in sm.get_opcodes():
    if tag == 'replace':
        a_remainder.extend(a[i1:i2])
        b_remainder.extend(b[j1:j2])
    elif tag == 'delete':
        a_remainder.extend(a[i1:i2])
    elif tag == 'insert':
        b_remainder.extend(b[j1:j2])
    elif tag == 'equal':
        c.extend(a[i1:i2])

And now...

>>> print c
[3, 4, 5, 5, 6]
>>> print a_remainder
[5]
>>> print b_remainder
[1, 4, 7]
Steven
  • 28,002
  • 5
  • 61
  • 51
0
Aset = Set(a);
Bset = Set(b);
a_remainder = a.difference(b);
b_remainder = b.difference(a);
c = a.intersection(b);

But if you need c to have duplicates, and order is important for you,
you may look for w:Longest common subsequence problem

kirilloid
  • 14,011
  • 6
  • 38
  • 52
-1

I don't think you should actually use this solution, but I took this opportunity to practice with lambda functions and here is what I came up with :)

a = [3,4,5,5,5,6]
b = [1,3,4,4,5,5,6,7]
dedup = lambda x: [set(x)] if len(set(x)) == len(x) else [set(x)] + dedup([x[i] for i in range(1, len(x)) if x[i] == x[i-1]])
default_set = lambda x: (set() if x[0] is None else x[0], set() if x[1] is None else x[1])
deduped = map(default_set, map(None, dedup(a), dedup(b)))
get_result = lambda f: reduce(lambda x, y: list(x) + list(y), map(lambda x: f(x[0], x[1]), deduped))
c = get_result(lambda x, y: x.intersection(y))          # [3, 4, 5, 6, 5]
a_remainder = get_result(lambda x, y: x.difference(y))  # [5]
b_remainder = get_result(lambda x, y: y.difference(x))  # [1, 7, 4]

I'm pretty sure izip_longest would have simplified this a bit (wouldn't have needed the default_set lambda), but I was testing this with Python 2.5.

Here are some of the intermediate values used in the calculation in case anyone wants to understand this:

dedup(a) = [set([3, 4, 5, 6]), set([5]), set([5])]
dedup(b) = [set([1, 3, 4, 5, 6, 7]), set([4, 5])]
deduped = [(set([3, 4, 5, 6]), set([1, 3, 4, 5, 6, 7])), (set([5]), set([4, 5])), (set([5]), set([]))]
Andrew Clark
  • 202,379
  • 35
  • 273
  • 306
  • Out of curiosity, why the downvote? Unlike other downvoted answers mine gets the correct solution. I know it is hard to read but that is why the disclaimer is there. I only posted this because I thought others would be interested in a solution that uses set operations, even if it is a bit obscure. – Andrew Clark Feb 23 '11 at 18:25
  • 1
    A solution is only correct if you can independently verify it. Otherwise you're relying on the author and the few examples that you can think of to test it. Your solution is an excellent mind exercise for the reader, but you don't want the code verification phase to be longer than the code writing phase. ;) – Rosh Oxymoron Feb 23 '11 at 18:56