0

I asked how to create a list with only the new entries comparing two files here: Create a list of new urls contained in objects in python.

The problem is that I want to keep duplicates and it doesn't seem possible with set() (my real list has 35 000 values). How can I change it to keep duplicates ?

    #last file
ccv_last = []
for item_last in last_data:
    results_last = item_last["results"]
    if results_last:
        for result_last in results_last:
            ccv_last.append(result_last["img_url"])
#second file
ccv_second = []
for item_second in second_data:
    results_second = item_second["results"]
    if results_second:
        for result_second in results_second:
            ccv_second.append(result_second["img_url"])

diff_list = list(set(ccv_last)-set(ccv_second)))

The result even if I have duplicates:

['https://img.com/30.jpg', 'https://img.com/3.jpg']
lf_celine
  • 653
  • 7
  • 19
  • 1
    Why are you using a `set` then? A set, by definition, can't contain duplicates. A `set` "modified to contain duplicates" is just a `list`. – DeepSpace May 04 '20 at 15:39
  • Because I didn't know that when I used it and I find out when I was looking for a solution to my problem – lf_celine May 04 '20 at 15:40

2 Answers2

1

As pointed out in the comments, set is designed to remove duplicates. What you want is not natively supported in Python (at least not in an efficient way), but there are simply ways to implement them.

Here I'll demonstrate two different implementations. My assumption here is that:

  • You're only interested in performing this "set difference" operation, which is defined as "removing (a single copy of) an element from the left set for each element in the right set".
  • The difference operation is performed after all insertion operations, and is only performed once.

Solution 1: Sorted Lists

It is possible to implement this naively using lists:

def difference_list(a, b):
    a = list(a)  # make a copy of `a`
    for x in b:  # iterate over elements in `b`
        if x in a:
            a.remove(x)  # removes a single copy of `x` in `a`
    return a

However, this is very inefficient. If you're familiar with complexity theory, the above algorithm has a time complexity of O(|a|*|b|), where |a| means the length of list a.

A more efficient approach is to sort both lists and then iterate through both arrays at the same time:

def difference_sorted(a, b):
    a = sorted(a)  # sort both lists
    b = sorted(b)
    result = []
    pos = 0  # pointer into list `b`
    for x in a:
        while pos < len(b) and b[pos] < x:
            pos += 1
        if pos < len(b) and x == b[pos]:
            pos += 1  # `x` exists in `b` at index `pos`
            continue
        result.append(x)
    return result

Solution 2: Counters

The collections package contains a Counter type that counts the occurrences of each added element:

from collections import Counter
c = Counter(['a', 'b', 'c', 'a', 'a', 'b'])
print(c)  # Counter({'a': 3, 'b': 2, 'c': 1})

So what you can do is to maintain a counter for each "set", and do the difference operation efficiently using the stored counts:

def difference_counter(a, b):
    a = Counter(a)  # construct counters for both lists
    b = Counter(b)  # you can also maintain counters directly while iterating over raw data

    # A counter is essentially a dictionary, so the `dict` methods are supported.
    for x, cnt in b.items():  # iterate over elements and counts in `b`
        if x in a and a[x] <= cnt:  # if count of `x` in `a` <= count of `x` in `b`
            del a[x]  # remove `x` from `a`
        else:
            a[x] -= cnt  # subtract the count in `b` from the count in `a`
    result = []  # convert the counter back into a list
    for x, cnt in a.items():
        result += [x] * cnt
    return result

You can verify that the implementations give correct results:

a = [1, 1, 1, 1, 2, 2, 2, 3, 3]
b = [1, 2, 2, 3, 4, 5]
result = [1, 1, 1, 2, 3]
assert sorted(difference_list(a, b)) == sorted(result)
assert sorted(difference_sorted(a, b)) == sorted(result)
assert sorted(difference_counter(a, b)) == sorted(result)
Zecong Hu
  • 2,584
  • 18
  • 33
0

There is a module for that https://pypi.org/project/multiset/

pip install multiset
>>> set1 = Multiset('aab')
>>> set2 = Multiset('abc')
>>> sorted(set1 | set2)
['a', 'a', 'b', 'c']
Tronic
  • 1,248
  • 12
  • 16