12

I have a list that I'm attempting to remove duplicate items from. I'm using python 2.7.1 so I can simply use the set() function. However, this reorders my list. Which for my particular case is unacceptable.

Below is a function I wrote; which does this. However I'm wondering if there's a better/faster way. Also any comments on it would be appreciated.

    def ordered_set(list_):

        newlist = []
        lastitem = None
        for item in list_:

            if item != lastitem:
                newlist.append(item)
                lastitem = item

        return newlist

The above function assumes that none of the items will be None, and that the items are in order (ie, ['a', 'a', 'a', 'b', 'b', 'c', 'd'])

The above function returns ['a', 'a', 'a', 'b', 'b', 'c', 'd'] as ['a', 'b', 'c', 'd'].

rectangletangle
  • 50,393
  • 94
  • 205
  • 275
  • There is another similar question that gives a link to an implementation, http://stackoverflow.com/questions/1653970/does-python-have-an-ordered-set#1653974 – milkypostman Jun 01 '11 at 07:24
  • Would it be preferable to have the list automatically stay sorted and be duplicate-free? Or is it fine to have to periodically purge the list of duplicates? – ninjagecko Jun 01 '11 at 07:26
  • You example code implies that `_list` is a sequence that has only contiguous duplicates. Is that what you mean? It won't work for inputs like these `[1, 2, -4, -4, 1]`: `1` will still be duplicated, while `-4` will be de-duplicated. – Pavel Repin Jun 01 '11 at 07:43

8 Answers8

12

Another very fast method with set:

def remove_duplicates(lst):
    dset = set()
    # relies on the fact that dset.add() always returns None.
    return [item for item in lst
            if item not in dset and not dset.add(item)] 
Zaur Nasibov
  • 22,280
  • 12
  • 56
  • 83
9

Use an OrderedDict:

from collections import OrderedDict

l = ['a', 'a', 'a', 'b', 'b', 'c', 'd']
d = OrderedDict()

for x in l:
    d[x] = True

# prints a b c d
for x in d:
    print x,
print
mhyfritz
  • 8,342
  • 2
  • 29
  • 29
  • 2
    this requires that the elements by hashable; for example, this would not work if the elements were lists or dictionaries; furthermore this requires an `O(n)` operation, rather than a bunch of `O(1)` operations (which may or may not be what the OP wants, just something to keep in mind) – ninjagecko Jun 01 '11 at 07:23
  • 21
    I've never seen a for loop described as "a bunch of O(1) operations" before. Hm, n O(1) operations would be... O(n) – mattdeboard Apr 03 '12 at 02:43
  • I think it's along the same lines as describing 4 as 2 + 2. – rectangletangle Feb 27 '14 at 02:05
  • But what if the list is LARGE? Saving those extra True would be costly memory-wise. I don't really understand why python has no ordered set. What's wrong with keeping insertion order by default? It's just a nice extra property to have! – Sean Nov 06 '17 at 08:36
7

Assuming the input sequence is unordered, here's O(N) solution (both in space and time). It produces a sequence with duplicates removed, while leaving unique items in the same relative order as they appeared in the input sequence.

>>> def remove_dups_stable(s):
...   seen = set()
...   for i in s:
...     if i not in seen:
...       yield i
...       seen.add(i)

>>> list(remove_dups_stable(['q', 'w', 'e', 'r', 'q', 'w', 'y', 'u', 'i', 't', 'e', 'p', 't', 'y', 'e']))
['q', 'w', 'e', 'r', 'y', 'u', 'i', 't', 'p']
Pavel Repin
  • 30,663
  • 1
  • 34
  • 41
  • [2,1,2,1] could become [1,2] or [2,1], I agree that [2,1] makes more sense in this case but it isn't implied in the question. If the set is ordered then your solution is still good, so +1 – Rusty Rob Jun 01 '11 at 08:31
  • @robert, might as well upvote @zaur's solution since it also does exactly same thing using a list comprehension. In retrospect, I like that one more since it looks like less code :) – Pavel Repin Jun 01 '11 at 08:36
  • Oops! @robert, I wasn't paying attention to the chronology. Duly up-noted :) – Pavel Repin Jun 01 '11 at 08:47
  • thanks. Yes @zaur's solution is good but will fail if the element can't be hashed. (we will all fail if the list isn't ordered). I think my solution is could be the fastest but haven't benched on large arrays that use all my memory =) – Rusty Rob Jun 01 '11 at 08:57
5

I know this has already been answered, but here's a one-liner (plus import):

from collections import OrderedDict
def dedupe(_list):
    return OrderedDict((item,None) for item in _list).keys()

>>> dedupe(['q', 'w', 'e', 'r', 'q', 'w', 'y', 'u', 'i', 't', 'e', 'p', 't', 'y', 'e'])
['q', 'w', 'e', 'r', 'y', 'u', 'i', 't', 'p']
sunetos
  • 3,448
  • 1
  • 23
  • 14
3

I think this is perfectly OK. You get O(n) performance which is the best you could hope for.

If the list were unordered, then you'd need a helper set to contain the items you've already visited, but in your case that's not necessary.

Tim Pietzcker
  • 328,213
  • 58
  • 503
  • 561
2

There is unique_everseen solution described in http://docs.python.org/2/library/itertools.html

def unique_everseen(iterable, key=None):
    "List unique elements, preserving order. Remember all elements ever seen."
    # unique_everseen('AAAABBBCCDAABBB') --> A B C D
    # unique_everseen('ABBCcAD', str.lower) --> A B C D
    seen = set()
    seen_add = seen.add
    if key is None:
        for element in ifilterfalse(seen.__contains__, iterable):
            seen_add(element)
            yield element
    else:
        for element in iterable:
            k = key(element)
            if k not in seen:
                seen_add(k)
                yield element
aloschilov
  • 566
  • 4
  • 5
2

if your list isn't sorted then your question doesn't make sense. e.g. [1,2,1] could become [1,2] or [2,1]

if your list is large you may want to write your result back into the same list using a SLICE to save on memory:

>>> x=['a', 'a', 'a', 'b', 'b', 'c', 'd']
>>> x[:]=[x[i] for i in range(len(x)) if i==0 or x[i]!=x[i-1]]
>>> x
['a', 'b', 'c', 'd']

for inline deleting see Remove items from a list while iterating or Remove items from a list while iterating without using extra memory in Python

one trick you can use is that if you know x is sorted, and you know x[i]=x[i+j] then you don't need to check anything between x[i] and x[i+j] (and if you don't need to delete these j values, you can just copy the values you want into a new list)

So while you can't beat n operations if everything in the set is unique i.e. len(set(x))=len(x) There is probably an algorithm that has n comparisons as its worst case but can have n/2 comparisons as its best case (or lower than n/2 as its best case if you know somehow know in advance that len(x)/len(set(x))>2 because of the data you've generated):

The optimal algorithm would probably use binary search to find maximum j for each minimum i in a divide and conquer type approach. Initial divisions would probably be of length len(x)/approximated(len(set(x))). Hopefully it could be carried out such that even if len(x)=len(set(x)) it still uses only n operations.

Community
  • 1
  • 1
Rusty Rob
  • 16,489
  • 8
  • 100
  • 116
0

Looks ok to me. If you really want to use sets do something like this:

def ordered_set (_list) :
    result = set()
    lastitem = None
    for item in _list :
        if item != lastitem :
            result.add(item)
            lastitem = item
    return sorted(tuple(result))

I don't know what performance you will get, you should test it; probably the same because of method's overheat!

If you really are paranoid, just like me, read here:

http://wiki.python.org/moin/HowTo/Sorting/

http://wiki.python.org/moin/PythonSpeed/PerformanceTips

Just remembered this(it contains the answer):

http://www.peterbe.com/plog/uniqifiers-benchmark

StefanNch
  • 2,569
  • 24
  • 31
  • Did you test your code? You never assign anything to `result`. Also, the whole point of a set is that you don't need to check if it already contains an element or not - you'd just to `result = set(_list)`. No iteration required. But this method (or yours) would fail if the order of items is any other than alphabetical... – Tim Pietzcker Jun 01 '11 at 07:38
  • Anyone trying to use this function will get: NameError: global name 'newlist' is not defined – Rusty Rob Jun 01 '11 at 07:56