Ordered Sets Python 2.7

Question

I have a list that I'm attempting to remove duplicate items from. I'm using python 2.7.1 so I can simply use the set() function. However, this reorders my list. Which for my particular case is unacceptable.

Below is a function I wrote; which does this. However I'm wondering if there's a better/faster way. Also any comments on it would be appreciated.

    def ordered_set(list_):

        newlist = []
        lastitem = None
        for item in list_:

            if item != lastitem:
                newlist.append(item)
                lastitem = item

        return newlist

The above function assumes that none of the items will be None, and that the items are in order (ie, ['a', 'a', 'a', 'b', 'b', 'c', 'd'])

The above function returns ['a', 'a', 'a', 'b', 'b', 'c', 'd'] as ['a', 'b', 'c', 'd'].

There is another similar question that gives a link to an implementation, http://stackoverflow.com/questions/1653970/does-python-have-an-ordered-set#1653974 — milkypostman, Jun 01 '11 at 07:24
Would it be preferable to have the list automatically stay sorted and be duplicate-free? Or is it fine to have to periodically purge the list of duplicates? — ninjagecko, Jun 01 '11 at 07:26
You example code implies that `_list` is a sequence that has only contiguous duplicates. Is that what you mean? It won't work for inputs like these `[1, 2, -4, -4, 1]`: `1` will still be duplicated, while `-4` will be de-duplicated. — Pavel Repin, Jun 01 '11 at 07:43

Zaur Nasibov · Answer 1 · 2017-05-25T20:23:02.857

12

Another very fast method with set:

def remove_duplicates(lst):
    dset = set()
    # relies on the fact that dset.add() always returns None.
    return [item for item in lst
            if item not in dset and not dset.add(item)]

edited May 25 '17 at 20:23

answered Jun 01 '11 at 07:44

Zaur Nasibov

22,280
12
56
83

Thanks for supplementing, Pavel. – Zaur Nasibov Jun 01 '11 at 08:22

score 9 · Accepted Answer · answered Jun 01 '11 at 07:11

9

Use an OrderedDict:

from collections import OrderedDict

l = ['a', 'a', 'a', 'b', 'b', 'c', 'd']
d = OrderedDict()

for x in l:
    d[x] = True

# prints a b c d
for x in d:
    print x,
print

answered Jun 01 '11 at 07:11

mhyfritz

8,342
2
29
29

2

this requires that the elements by hashable; for example, this would not work if the elements were lists or dictionaries; furthermore this requires an `O(n)` operation, rather than a bunch of `O(1)` operations (which may or may not be what the OP wants, just something to keep in mind) – ninjagecko Jun 01 '11 at 07:23
21

I've never seen a for loop described as "a bunch of O(1) operations" before. Hm, n O(1) operations would be... O(n) – mattdeboard Apr 03 '12 at 02:43
I think it's along the same lines as describing 4 as 2 + 2. – rectangletangle Feb 27 '14 at 02:05
But what if the list is LARGE? Saving those extra True would be costly memory-wise. I don't really understand why python has no ordered set. What's wrong with keeping insertion order by default? It's just a nice extra property to have! – Sean Nov 06 '17 at 08:36

Pavel Repin · Answer 3 · 2011-06-01T08:11:49.743

7

Assuming the input sequence is unordered, here's O(N) solution (both in space and time). It produces a sequence with duplicates removed, while leaving unique items in the same relative order as they appeared in the input sequence.

>>> def remove_dups_stable(s):
...   seen = set()
...   for i in s:
...     if i not in seen:
...       yield i
...       seen.add(i)

>>> list(remove_dups_stable(['q', 'w', 'e', 'r', 'q', 'w', 'y', 'u', 'i', 't', 'e', 'p', 't', 'y', 'e']))
['q', 'w', 'e', 'r', 'y', 'u', 'i', 't', 'p']

edited Jun 01 '11 at 08:11

answered Jun 01 '11 at 07:56

Pavel Repin

30,663
1
34
41

[2,1,2,1] could become [1,2] or [2,1], I agree that [2,1] makes more sense in this case but it isn't implied in the question. If the set is ordered then your solution is still good, so +1 – Rusty Rob Jun 01 '11 at 08:31
@robert, might as well upvote @zaur's solution since it also does exactly same thing using a list comprehension. In retrospect, I like that one more since it looks like less code :) – Pavel Repin Jun 01 '11 at 08:36
Oops! @robert, I wasn't paying attention to the chronology. Duly up-noted :) – Pavel Repin Jun 01 '11 at 08:47
thanks. Yes @zaur's solution is good but will fail if the element can't be hashed. (we will all fail if the list isn't ordered). I think my solution is could be the fastest but haven't benched on large arrays that use all my memory =) – Rusty Rob Jun 01 '11 at 08:57

score 5 · Answer 4 · answered Aug 10 '11 at 22:41

I know this has already been answered, but here's a one-liner (plus import):

from collections import OrderedDict
def dedupe(_list):
    return OrderedDict((item,None) for item in _list).keys()

>>> dedupe(['q', 'w', 'e', 'r', 'q', 'w', 'y', 'u', 'i', 't', 'e', 'p', 't', 'y', 'e'])
['q', 'w', 'e', 'r', 'y', 'u', 'i', 't', 'p']

score 3 · Answer 5 · answered Jun 01 '11 at 07:04

3

I think this is perfectly OK. You get O(n) performance which is the best you could hope for.

If the list were unordered, then you'd need a helper set to contain the items you've already visited, but in your case that's not necessary.

answered Jun 01 '11 at 07:04

Tim Pietzcker

328,213
58
503
561

Why the downvote? I see nothing wrong with Tim Pietzckers post. – rectangletangle Jun 01 '11 at 07:43

score 2 · Answer 6 · answered Jan 16 '14 at 14:39

There is unique_everseen solution described in http://docs.python.org/2/library/itertools.html

def unique_everseen(iterable, key=None):
    "List unique elements, preserving order. Remember all elements ever seen."
    # unique_everseen('AAAABBBCCDAABBB') --> A B C D
    # unique_everseen('ABBCcAD', str.lower) --> A B C D
    seen = set()
    seen_add = seen.add
    if key is None:
        for element in ifilterfalse(seen.__contains__, iterable):
            seen_add(element)
            yield element
    else:
        for element in iterable:
            k = key(element)
            if k not in seen:
                seen_add(k)
                yield element

score 2 · Answer 7 · edited May 23 '17 at 12:18

if your list isn't sorted then your question doesn't make sense. e.g. [1,2,1] could become [1,2] or [2,1]

if your list is large you may want to write your result back into the same list using a SLICE to save on memory:

>>> x=['a', 'a', 'a', 'b', 'b', 'c', 'd']
>>> x[:]=[x[i] for i in range(len(x)) if i==0 or x[i]!=x[i-1]]
>>> x
['a', 'b', 'c', 'd']

for inline deleting see Remove items from a list while iterating or Remove items from a list while iterating without using extra memory in Python

one trick you can use is that if you know x is sorted, and you know x[i]=x[i+j] then you don't need to check anything between x[i] and x[i+j] (and if you don't need to delete these j values, you can just copy the values you want into a new list)

So while you can't beat n operations if everything in the set is unique i.e. len(set(x))=len(x) There is probably an algorithm that has n comparisons as its worst case but can have n/2 comparisons as its best case (or lower than n/2 as its best case if you know somehow know in advance that len(x)/len(set(x))>2 because of the data you've generated):

The optimal algorithm would probably use binary search to find maximum j for each minimum i in a divide and conquer type approach. Initial divisions would probably be of length len(x)/approximated(len(set(x))). Hopefully it could be carried out such that even if len(x)=len(set(x)) it still uses only n operations.

StefanNch · Answer 8 · 2011-06-01T08:37:17.487

0

Looks ok to me. If you really want to use sets do something like this:

def ordered_set (_list) :
    result = set()
    lastitem = None
    for item in _list :
        if item != lastitem :
            result.add(item)
            lastitem = item
    return sorted(tuple(result))

I don't know what performance you will get, you should test it; probably the same because of method's overheat!

If you really are paranoid, just like me, read here:

http://wiki.python.org/moin/HowTo/Sorting/

http://wiki.python.org/moin/PythonSpeed/PerformanceTips

Just remembered this(it contains the answer):

http://www.peterbe.com/plog/uniqifiers-benchmark

edited Jun 01 '11 at 08:37

answered Jun 01 '11 at 07:17

StefanNch

2,569
24
31

Did you test your code? You never assign anything to `result`. Also, the whole point of a set is that you don't need to check if it already contains an element or not - you'd just to `result = set(_list)`. No iteration required. But this method (or yours) would fail if the order of items is any other than alphabetical... – Tim Pietzcker Jun 01 '11 at 07:38
Anyone trying to use this function will get: NameError: global name 'newlist' is not defined – Rusty Rob Jun 01 '11 at 07:56

Ordered Sets Python 2.7

8 Answers8

Linked