As pointed out in the comments, set
is designed to remove duplicates. What you want is not natively supported in Python (at least not in an efficient way), but there are simply ways to implement them.
Here I'll demonstrate two different implementations. My assumption here is that:
- You're only interested in performing this "set difference" operation, which is defined as "removing (a single copy of) an element from the left set for each element in the right set".
- The difference operation is performed after all insertion operations, and is only performed once.
Solution 1: Sorted Lists
It is possible to implement this naively using lists:
def difference_list(a, b):
a = list(a) # make a copy of `a`
for x in b: # iterate over elements in `b`
if x in a:
a.remove(x) # removes a single copy of `x` in `a`
return a
However, this is very inefficient. If you're familiar with complexity theory, the above algorithm has a time complexity of O(|a|*|b|)
, where |a|
means the length of list a
.
A more efficient approach is to sort both lists and then iterate through both arrays at the same time:
def difference_sorted(a, b):
a = sorted(a) # sort both lists
b = sorted(b)
result = []
pos = 0 # pointer into list `b`
for x in a:
while pos < len(b) and b[pos] < x:
pos += 1
if pos < len(b) and x == b[pos]:
pos += 1 # `x` exists in `b` at index `pos`
continue
result.append(x)
return result
Solution 2: Counters
The collections
package contains a Counter
type that counts the occurrences of each added element:
from collections import Counter
c = Counter(['a', 'b', 'c', 'a', 'a', 'b'])
print(c) # Counter({'a': 3, 'b': 2, 'c': 1})
So what you can do is to maintain a counter for each "set", and do the difference operation efficiently using the stored counts:
def difference_counter(a, b):
a = Counter(a) # construct counters for both lists
b = Counter(b) # you can also maintain counters directly while iterating over raw data
# A counter is essentially a dictionary, so the `dict` methods are supported.
for x, cnt in b.items(): # iterate over elements and counts in `b`
if x in a and a[x] <= cnt: # if count of `x` in `a` <= count of `x` in `b`
del a[x] # remove `x` from `a`
else:
a[x] -= cnt # subtract the count in `b` from the count in `a`
result = [] # convert the counter back into a list
for x, cnt in a.items():
result += [x] * cnt
return result
You can verify that the implementations give correct results:
a = [1, 1, 1, 1, 2, 2, 2, 3, 3]
b = [1, 2, 2, 3, 4, 5]
result = [1, 1, 1, 2, 3]
assert sorted(difference_list(a, b)) == sorted(result)
assert sorted(difference_sorted(a, b)) == sorted(result)
assert sorted(difference_counter(a, b)) == sorted(result)