Python: how to remove duplicates from a list using a set (order is important)

Question

so I have this list: a = [-11, 13, 13, 10, -11, 10, 9, -3, 6, -9, -6, -6, 13, 8, -11, -5, 6, -8, -12, 5, -9, -1, -5, 2, -2, 13, 14, -9, 7, -4]

and by using a set I need to remove the duplicates and also keep them in the same order

I used this code:

def unique(a):
    a = set(a)
    return list(a)

it does remove the duplicates when I use it but the problem is it returns them in numerical order like this:

>>> unique(a)
[-2, 2, 5, 6, 7, 8, 9, 10, 13, 14, -12, -11, -9, -8, -6, -5, -4, -3, -1]

how can I return it in the same order as the original list while removing duplicates using sets?

EDIT:

so I've used this code because it worked:

def unique(a):
    seen = set()
    return [seen.add(x) or x for x in a if x not in seen]

but can someone explain to me what it does? because I need to make another once but it returns the list without negative numbers, and I can't do that unless I understand what that code does

The fact that this question is marked as Duplicate is ironic. — YM_coding, Feb 09 '21 at 11:56

abarnert · Answer 1 · 2013-10-09T18:03:09.723

This function already exists in the itertools recipes, as unique_everseen. You can copy and paste it from there, or read it to see how it works, or install the third-party package more-itertools and use it from there.

Here's a simplified version of the code:

def unique_everseen(iterable):
    seen = set()
    for element in iterable:
        if element not in seen:
            seen.add(element)
            yield element

The version in the recipes allows for a key function, which you don't need, and it has two optimizations. But first understand the simple version:

seen is a set of all values seen so far. For each value, we check whether it's in seen. If so, we skip it. Otherwise, we add it to the set and yield it. So, we yield each element only the first time it's seen.

The first optimization in the recipe version is simple: looking up the seen.add method isn't quite free, so we do it once instead of N times, by doing seen_add = seen.add. This makes a sizable difference when benchmarking trivial cases, like a list of small integers; it may not make much difference in real use cases with values that are more expensive to hash.

The second optimization is to use ifilterfalse instead of an if to skip over the elements that have already been seen. Basically this means that if you have N elements and M unique elements, you only do M iterations in Python and N in the optimized C code inside ifilterfalse, instead of doing N in Python. Since iterating in C is much faster, this is worth it unless almost all of your elements are unique.

To make it work with a key function, all you have to do is keep a set of key(element) values seen so far, instead of element values seen so far. This makes the ifilterfalse optimization a little harder to do and much less effective, so it isn't done.

If you're only dealing with sequences, not arbitrary iterables, and you can count on Python 2.7+, there's another way to do this which is almost as efficient, and even simpler:

def unique(a):
    return OrderedDict.fromkeys(a).keys()

score 2 · Answer 2 · answered Oct 09 '13 at 17:42

2

Abuse of list comprehension:

def unique(seq):
    seen = set()
    return [seen.add(x) or x for x in seq if x not in seen]
    # or use parentheses instead of brackets above for a generator

answered Oct 09 '13 at 17:42

kindall

178,883
35
278
309

`seen.add` always returns `None`, so this will not work. – abarnert Oct 09 '13 at 17:43
After the edit, it does work, but it's still pretty horrible. Using `or` to sequence two operations in an expression is even more of an abuse than using a list comprehension for side effects. – abarnert Oct 09 '13 at 17:44
Actually, you can remove the listcomp abuse by just putting the `add` in the condition: `[x for x in seq if x not in seen and not seen.add(x)]`. But it's still an abuse of `seen.add`, and possibly even harder to see that way… – abarnert Oct 09 '13 at 18:04
I feel like `set` could really use a method that adds an item and returns a Boolean indicating whether the item actually needed to be added. Then you could just do e.g. `[x for x in seq if seen.did_add(x)]` – kindall Oct 09 '13 at 18:13
Almost none of the built-in types' methods mutate but return a value… but one of the few exceptions, `dict.setdefault`, isn't too far away from your suggested `set.did_add`, so maybe it's reasonable. – abarnert Oct 09 '13 at 18:36

Python: how to remove duplicates from a list using a set (order is important)

2 Answers2