2

I am creating a classical "set" class to practice, and the first thing I want to do is remove all duplicates. I know I could do it easily with dictionary keys, but I wanted to try to improve my list comprehension. These two functions should do the same thing, but the second doesn't work. Why?

for element in elements:
            if elements.count(element) > 1:
                elements.remove(element)
        print(elements)

The second:

self.elements = [elements.remove(element) for element in elements
                 if elements.count(element) > 1]
flybonzai
  • 3,763
  • 11
  • 38
  • 72
  • 5
    Neither of the versions of your code will do what you want. Mutating a list as you are iterating over it will skip some values! – Blckknght Oct 16 '15 at 20:39
  • Possible duplicate of [Remove items from a list while iterating in Python](http://stackoverflow.com/questions/1207406/remove-items-from-a-list-while-iterating-in-python) – Makoto Oct 16 '15 at 20:40
  • what about list(set(elements)) ? – rebeling Oct 16 '15 at 20:50

2 Answers2

4

Don't iterate over and remove from the same list, you should also use a Counter dict to count the occurrences of each element if your objects are hashable:

from collections import Counter
cn = Counter(elements)
# elements[:] changes original list
elements[:] = (ele for ele in elements if ch[ele] < 2)

In your second code because list.remove is an inplace operation it will just add None's to your list anytime if elements.count(element) > 1 is True or else do nothing so the two code examples are completely different.

The first code if it does work only works by chance. When you remove an element from your list what a pointer was pointing to previously can change so you end up removing the wrong elements from your list.

An example of what your second code is doing and why your first is the wrong approach:

In [20]: l = [2,3,1,4,1,5]

In [21]: l = [l.remove(i)  if i > 1 else i for i in l]

In [22]: l
Out[22]: [None, 1, None, None]

Because you have changed the pointer values you end up removing the second 1 and with a few None's added because like all functions that operate inplace or don't specify a return value in python they return None by default.

If you actually want to get a unique set of all elements and not just keep the unique elements which is what your code seems to be attempting and also maintain the order, a collections.OrderedDict dict will do what you need:

from collections import OrderedDict
elements[:] =  collections.OrderedDict.fromkeys(elements)
Padraic Cunningham
  • 176,452
  • 29
  • 245
  • 321
  • Thanks, this makes sense! – flybonzai Oct 16 '15 at 20:58
  • No prob, the Counter dict approach also makes your code `O(n)` as opposed to `O(n^2)` as we only do a single pass to get the counts and then one more pass to filter the original list – Padraic Cunningham Oct 16 '15 at 21:01
  • The `Counter` code doesn't do what the questioner wants. It eliminates *all copies* of duplicated values, instead of all but one. – Blckknght Oct 16 '15 at 21:10
  • @Blckknght, it does what the OP is attempting to do with count so what they actually want is debatable, anyway that is secondary to what the OP asked as the question was about the reason the second code did not match the first and remove all duplicates could actually be just removing elements that appear more than once – Padraic Cunningham Oct 16 '15 at 21:15
  • @PadraicCunningham: I think the desire for a deduplicated list is pretty clear. The result should be something like `list(set(elements))` (perhaps with some specific order). You *can* use a `Counter` to improve the OP's original approach (replacing `elements.count[element]` with `cn[element]` and `elements.remove(element)` with `cn[element] -= 1`, and adding some new logic to yield the results), but your code is doing something different (*finding* the unique values, instead of *making* all the values unique). – Blckknght Oct 16 '15 at 21:30
  • @Blckknght, my code is doing what `if elements.count(element) > 1:elements.remove(element)` is doing, I went by the code provided as the OP said it happened to work. If they were basically creating a set I would simply use `collections.OrderedDict.fromkeys(elements)` – Padraic Cunningham Oct 16 '15 at 21:34
1

There are two issues with your code. The first issue is what you're explicitly asking about: The list comprehension version is going to assign a whole bunch of None values to self.elements. The None values are simply the return values from your calls to list.remove. It modifies the list in place, and doesn't have anything useful to return (so it returns None).

The comprehension [element for element in elements if elements.count(element) == 1 or elements.remove(element)] will work the same as your other code (since None is falsey and or short-circuits), but it still runs into the second issue. (It's also a bit of an ugly hack: The new list created by the comprehension will have the same contents as elements since remove modifies elements in place, and it's quite confusing. Writing hard to understand code is generally not a good idea.)

The second issue is that modifying a list while you are iterating over it can cause issues. List iterators work by index. The first item yielded by the iterator is from index 0, the second is from index 1, and so on. If you modify the list by removing an item early in the list, you'll shift the indexes of all the later items.

So, say you remove the first item (from index 0) just after your iterator has shown it to you. The list will shift all the later values up, but the iterator won't know about this. It will still yield the item at index 1 next, even though that used to be the item at index 2 (before the list was modified). The item originally at index 1 (and at index 0 after the previous item was removed) will be skipped by the iteration.

Here's a simple example of this issue, where values 2, 5, and 8 will be not be printed:

L = list(range(10)) # [0,1,2,3,4,5,6,7,8,9]
for x in L:
    print(x)
    if x % 3 == 1: # true for 1,4, and 7
        L.remove(x)

In the example, the logic for removing values is pretty simple and we never skip a value we would normally want to remove (so L has the expected value of [0,2,3,5,6,8,9] at the end), other code may not work as nicely.

A way to avoid this issue is to iterate over a copy of the list, while modifying the original. In this situation we'll also need to count in the original, rather than the copy:

for element in elements[:]: # copy list with a slice here!
    if elements.count(element) > 1:
        elements.remove(element)  # modify the original list

This is rather inefficient though, since removing an item from a list (at a position other than the end) needs to take the time to shift all the later values up one position. Counting is also slow, as you need to iterate the whole list for each item. It's much more efficient to keep track of the unique items you've seen so far, and skip duplicated items when you see them later:

seen = set()
results = []
for element in elements:
   if element not in seen:
       seen.add(element)
       results.append(element)

You can even build a somewhat awkward list comprehension (with side effects) of this code:

seen = set()
results = [element for element in elements
           if not (element in seen or seen.add(element))]

A better approach is usually to bundle the deduplicating logic into a generator function (like the unique_everseen recipe in the itertools documentation), and then call it with list(dedupe(elements)).

Blckknght
  • 100,903
  • 11
  • 120
  • 169