Fastest way to remove duplicates from a list of comparable, unhashable items

Question

How can I remove duplicate items from a list if all I know is that the list elements can be ordered? (I also don't care about the order of items in the list.)

Existing questions like How to remove duplicates from Python list and keep order? or Removing duplicates in lists require the use of a set, i.e. require the items in the list to be hashable. In my case, hashability is not a guarantee.

It is an absurd requiremen for the quesion "how can I remove duplicates from a list" -- a list can contain ANYTHING, so it really doesn't answer the question, does it? — vitiral, Sep 22 '15 at 05:07
@TigerhawkT3 please reopen. It took me a long time to find this and John La Rooy's answer is awesome! — vitiral, Sep 22 '15 at 05:12
wow, that comment is pretty buried as the third answer down, and really the comment on the comment is the answer. But whatever. — vitiral, Sep 22 '15 at 07:18
What about letting a Python set doing all the work? `list(set(list))` Does a set require hashable elements? — NoDataDumpNoContribution, Dec 04 '15 at 17:42
Yes, a set requires hashable elements. `set([{'a': 2}])` throws `TypeError: unhashable type: 'dict'`. If I want to remove duplicates from a list of dicts I can not use the "duplicates" those users marked. — vitiral, Dec 04 '15 at 18:44

score 7 · Accepted Answer · answered Sep 22 '15 at 00:43

7

Calling sorted on an already sorted list has negligible overhead in Python. It's not really worth adding the extra complexity and the possibility that someone accidentally passes the wrong parameter to function

from itertools import groupby
def remove_duplicates(data):
    ''' Remove duplicates from the data (normally a list).
        The data must be sortable and have an equality operator
    '''
    data = sorted(data)
    return [k for k, v in groupby(data)]

answered Sep 22 '15 at 00:43

John La Rooy

295,403
53
369
502

nice! I didn't realize itertools groupby could be used in this way, this is briliant! Looking at the groupby code it looks like the "k" doesn't even need to be hashable, which is exactly what I was looking for. – vitiral Sep 22 '15 at 05:06
Since the question is asking about "fastest", comparison results shall probably be added to back the claim. – ivan_pozdeev Dec 04 '15 at 17:06

vitiral · Answer 2 · 2015-09-22T05:08:04.663

0

Edit: See John La Rooy's answer for a better one.

Again, ths solution only works on a sortable list. If you pre-sorted it (really the objects only need to be grouped) you could set sort=False and then it would only need the comparison operator.

def remove_duplicates(data, sort=True):
    ''' Remove duplicates from the data (normally a list).
        The data must be sortable and have an equality operator
    '''
    if not data:
        return data
    if sort:
        data = sorted(data)
    out = [data[0]]
    for i, n in enumerate(data[1:]):
        if data[i] != n:
            out.append(n)
    return out

edited Sep 22 '15 at 05:08

answered Sep 21 '15 at 23:54

vitiral

8,446
8
29
43

2

You might as well define `__hash__` so you can use `set()`. – TigerhawkT3 Sep 22 '15 at 00:06
Rough testing, but your title says 'fastest', I seem to get faster results using `from itertools import izip_longest` and `out = [x for (x,y) in izip_longest(data,data[1:]) if x != y]`. Giving range(1000) * 3, so everything is triplicated, and pre-sorted, and run 10,000 iterations, it takes about 5 seconds with your code and 3.3 seconds with izip_longest, and the result lists == the same. – TessellatingHeckler Sep 22 '15 at 00:32
That's roughly what [this user](http://stackoverflow.com/a/19135774/478656) suggests in one of the duplicate threads, but a) unrelated to numpy, b) I suspect they have a bug with missing the final list item if it's not a duplicate, and c) They're using the builtin zip which doesn't have the same kind of speedup when I try it. (Maybe in Python 3 it does?) The itertools version pads the shorter list with None and fixes the last item bug, I think, as well as being faster. – TessellatingHeckler Sep 22 '15 at 00:39
It would be faster to iterate over one extra item than to make this copy of a slice `data[1:]` – John La Rooy Sep 22 '15 at 00:48
@TessellatingHeckler - Python 3's `zip()`, `itertools.zip_longest()`, `range()`, etc. already produce lazy objects instead of eagerly creating a `list`. – TigerhawkT3 Sep 22 '15 at 01:09
@JohnLaRooy what do you mean "faster to iterate over one extra item"? If I just do counting through the list in Python with an index counter starting at 1, it gets dramatically slower... – TessellatingHeckler Sep 22 '15 at 01:24
@TigerhawkT3 Defining __hash__ for mutables would be a rather bad idea. – vitiral Sep 22 '15 at 05:03
@JohnLaRooy you are right. Actually, I should use islice from itertools – vitiral Sep 22 '15 at 05:03
@TessellatingHeckler I like that. I bet John La Rooy beat both of us though :) Also, you forgot to sort. – vitiral Sep 22 '15 at 05:10
@JohnLaRooy's answer comes out (in my same test block) approximately the same as mine, but if he only called sorted() when requested, it's ~15% faster on pre-sorted ones. – TessellatingHeckler Sep 22 '15 at 05:12

Fastest way to remove duplicates from a list of comparable, unhashable items

2 Answers2