25

What is the fasted way to get a sorted, unique list in python? (I have a list of hashable things, and want to have something I can iterate over - doesn't matter whether the list is modified in place, or I get a new list, or an iterable. In my concrete use case, I'm doing this with a throwaway list, so in place would be more memory efficient.)

I've seen solutions like

input = [5, 4, 2, 8, 4, 2, 1]
sorted(set(input))

but it seems to me that first checking for uniqueness and then sorting is wasteful (since when you sort the list, you basically have to determine insertion points, and thus get the uniqueness test as a side effect). Maybe there is something more along the lines of unix's

cat list | sort | uniq

that just picks out consecutive duplications in an already sorted list?


Note in the question ' Fastest way to uniqify a list in Python ' the list is not sorted, and ' What is the cleanest way to do a sort plus uniq on a Python list? ' asks for the cleanest / most pythonic way, and the accepted answer suggests sorted(set(input)), which I'm trying to improve on.

TylerH
  • 20,799
  • 66
  • 75
  • 101
jdm
  • 9,470
  • 12
  • 58
  • 110
  • 4
    Does `sorted(set(input))` not meet your speed requirements in some way? – Aesthete Nov 28 '12 at 10:38
  • What does `timeit` tell you is fastest? – Martijn Pieters Nov 28 '12 at 10:40
  • 7
    Asking for the *fastest* is really not very meaningful, unless you tell us about your data (how many elements in the list, what fraction are duplicates, how much does it cost to compare two elements, how good and expensive is the hash function, etc) – NPE Nov 28 '12 at 10:42
  • @Aesthete: Honestly, the code I'm working on right now is not performance critical in any way. My concern is mostly aesthetic. Think of it as an excercise or code golf. (However, I did have a case in another project some time ago where this was an issue.) – jdm Nov 28 '12 at 10:45
  • 2
    @MartijnPieters: I haven't tested it yet. I could come up with two or three alternatives, but maybe there is a brilliant uniq-sort function in the standard library that I don't know of. That's what I'm looking for, approaches that I wouldn't think of myself. – jdm Nov 28 '12 at 10:47
  • 3
    @jdm - `sorted(set(input))` is pretty damn aesthetically pleasing IMHO, but I appreciate your quest, and wish you well with it. I'm interested to see the results. – Aesthete Nov 28 '12 at 10:47
  • @Aesthete: sorted(set(input)) is surely the simplest and cleanest way to do it, and it is of course pretty :-), but I'm worried that I'm buying this clean code with wastefulness under the hood. It's a bit like calculating `n!` recursively vs. in a loop. The recursive version is cleaner and more logical, but wasteful in a language without tail recursion. – jdm Nov 28 '12 at 13:01
  • @jdm - Sure, but this is python, and _readability counts_, If your concerns about the performance of some of python's most basic functionality are an issue, well there will always be other languages better suited to the task. I know a vector can be done in c++ in 2 lines, and I guarantee it's faster. – Aesthete Nov 28 '12 at 13:30
  • @NPE Constant-factor optimizations do matter, especially when the algorithm is used repeatedly on similarly sized data sets in an application. A `sortuniq` function in the programmer's toolbox would be a good thing, even if qualified as "only for lists of size N or smaller", etc. +1 for this question. – wberry Nov 28 '12 at 17:53
  • `sort | uniq` is inefficient - use `sort -u`. – user2357112 Dec 13 '17 at 01:10

5 Answers5

32

I believe sorted(set(sequence)) is the fastest way of doing it. Yes, set iterates over the sequence but that's a C-level loop, which is a lot faster than any looping you would do at python level.

Note that even with groupby you still have O(n) + O(nlogn) = O(nlogn) and what's worst is that groupby will require a python-level loop, which increases dramatically the constants in that O(n) thus in the end you obtain worst results.

When speaking of CPython the way to optimize things is to do as much as you can at C-level (see this answer to have an other example of counter-intuitive performance). To have a faster solution you must reimplement a sort, in a C-extensions. And even then, good luck with obtaining something as fast as python's Timsort!

A small comparison of the "canonical solution" versus the groupby solution:

>>> import timeit
>>> sequence = list(range(500)) + list(range(700)) + list(range(1000))
>>> timeit.timeit('sorted(set(sequence))', 'from __main__ import sequence', number=1000)
0.11532402038574219
>>> import itertools
>>> def my_sort(seq):
...     return list(k for k,_ in itertools.groupby(sorted(seq)))
... 
>>> timeit.timeit('my_sort(sequence)', 'from __main__ import sequence, my_sort', number=1000)
0.3162040710449219

As you can see it's 3 times slower.

The version provided by jdm is actually even worse:

>>> def make_unique(lst):
...     if len(lst) <= 1:
...         return lst
...     last = lst[-1]
...     for i in range(len(lst) - 2, -1, -1):
...         item = lst[i]
...         if item == last:
...             del lst[i]
...         else:
...             last = item
... 
>>> def my_sort2(seq):
...     make_unique(sorted(seq))
... 
>>> timeit.timeit('my_sort2(sequence)', 'from __main__ import sequence, my_sort2', number=1000)
0.46814608573913574

Almost 5 times slower. Note that using seq.sort() and then make_unique(seq) and make_unique(sorted(seq)) are actually the same thing, since Timsort uses O(n) space you always have some reallocation, so using sorted(seq) does not actually change much the timings.

The jdm's benchmarks give different results because the input he is using are way too small and thus all the time is taken by the time.clock() calls.

Community
  • 1
  • 1
Bakuriu
  • 98,325
  • 22
  • 197
  • 231
6

Maybe this is not the answer you are searching for, but anyway, you should take this into your consideration.

Basically, you have 2 operations on a list:

unique_list = set(your_list)       # O(n) complexity
sorted_list = sorted(unique_list)  # O(nlogn) complexity

Now, you say "it seems to me that first checking for uniqueness and then sorting is wasteful", and you are right. But, how bad really is that redundant step? Take n = 1000000:

# sorted(set(a_list))
O(n) => 1000000
o(nlogn) => 1000000 * 20 = 20000000
Total => 21000000

# Your fastest way
O(nlogn) => 20000000
Total: 20000000

Speed gain: (1 - 20000000/21000000) * 100 = 4.76 %

For n = 5000000, speed gain: ~1.6 %

Now, is that optimization worth it?

kaspersky
  • 3,959
  • 4
  • 33
  • 50
  • 1
    I would not actually substitute numbers into an O(n) expression. The results are largely meaningless. – Waleed Khan Nov 28 '12 at 11:38
  • Ideally, it is not meaningless, but in most cases, you are right. I just intended to make some comparisons. I wanted to show that the bottleneck here was sort, not the set creation, so optimizing the sort function would be much more productive than getting rid of eliminating the set creation redundance. – kaspersky Nov 28 '12 at 11:46
  • You could demonstrate it without the numbers as well: O(n) + O(nlogn) = O(n+nlogn) = O(n(1+logn)) = O(nlogn) which means that the complexity remains the same. – GaretJax Nov 28 '12 at 11:57
  • Creating the `set` first means potentially the `n` used by the sort will be smaller. – Mark Ransom Apr 21 '23 at 03:43
3

This is just something I whipped up in a couple minutes. The function modifies a list in place, and removes consecutive repeats:

def make_unique(lst):
    if len(lst) <= 1:
        return lst
    last = lst[-1]
    for i in range(len(lst) - 2, -1, -1):
        item = lst[i]
        if item == last:
            del lst[i]
        else:
            last = item

Some representative input data:

inp = [
(u"Tomato", "de"), (u"Cherry", "en"), (u"Watermelon", None), (u"Apple", None),
(u"Cucumber", "de"), (u"Lettuce", "de"), (u"Tomato", None), (u"Banana", None),
(u"Squash", "en"), (u"Rubarb", "de"), (u"Lemon", None),
]

Make sure both variants work as wanted:

print inp
print sorted(set(inp))
# copy because we want to modify it in place
inp1 = inp[:]
inp1.sort()
make_unique(inp1)
print inp1

Now to the testing. I'm not using timeit, since I don't want to time the copying of the list, only the sorting. time1 is sorted(set(...), time2 is list.sort() followed by make_unique, and time3 is the solution with itertools.groupby by Avinash Y.

import time
def time1(number):
    total = 0
    for i in range(number):
        start = time.clock()
        sorted(set(inp))
        total += time.clock() - start
    return total

def time2(number):
    total = 0
    for i in range(number):
        inp1 = inp[:]
        start = time.clock()
        inp1.sort()
        make_unique(inp1)
        total += time.clock() - start
    return total

import itertools 

def time3(number): 
    total = 0 
    for i in range(number): 
        start = time.clock() 
        list(k for k,_ in itertools.groupby(sorted(inp))) 
        total += time.clock() - start 
    return total

sort + make_unique is approximately as fast as sorted(set(...)). I'd have to do a couple more iterations to see which one is potentially faster, but within the variations they are very similar. The itertools version is a bit slower.

# done each 3 times
print time1(100000)
# 2.38, 3.01, 2.59
print time2(100000)
# 2.88, 2.37, 2.6
print time3(100000)
# 4.18, 4.44, 4.67

Now with a larger list (the + str(i) is to prevent duplicates):

old_inp = inp[:]
inp = []
for i in range(100):
    for j in old_inp:
        inp.append((j[0] + str(i), j[1]))

print time1(10000)
# 40.37
print time2(10000)
# 35.09
print time3(10000)
# 40.0

Note that if there are a lot of duplicates in the list, the first version is much faster (since it does less sorting).

inp = []
for i in range(100):
    for j in old_inp:
        #inp.append((j[0] + str(i), j[1]))
        inp.append((j[0], j[1]))

print time1(10000)
# 3.52
print time2(10000)
# 26.33
print time3(10000)
# 20.5
jdm
  • 9,470
  • 12
  • 58
  • 110
  • 3
    Really bad example for the timings. With an input so small you have problems with granularity. – Bakuriu Nov 28 '12 at 13:01
3
import numpy as np
np.unique(...)

The np.unique function returns an ndarray unique and sorted based on an array-like parameter. This will work with any numpy types, but also regular python values that are orderable.

If you need a regular python list, use np.unique(...).tolist()

Ben
  • 836
  • 8
  • 18
user2473519
  • 191
  • 1
  • 4
1
>>> import itertools
>>> a=[2,3,4,1,2,7,8,3]
>>> list(k for k,_ in itertools.groupby(sorted(a)))
[1, 2, 3, 4, 7, 8]
Avinash Y
  • 58
  • 2