1

I have an array of numbers. I want to sort them and remove duplicates. This answer suggest to use set and sort for that kind of operation. The order of operations shouldn't change the results, so I measured the time of computations.

from numpy.random import randint
from time import clock

n = 1000000

A=randint(0,n,n)

t1=clock()
B=sorted(set(A))
t2=clock()
C=set(sorted(A))
t3=clock()

print t2-t1, t3-t2

>>> 0.48011 1.339263

sorted(set(A)) is roughly three times faster than set(sorted(A)).

What makes the one faster than the other? Also, is there any faster way to do it?

Community
  • 1
  • 1
ysakamoto
  • 2,512
  • 1
  • 16
  • 22

1 Answers1

8

This is because when you call:

set(sorted(A))

you are sorting the original full list and then filtering out the duplicate values. However, when you call:

sorted(set(A))

you are first shortening the list by removing duplicate values using set and then sorting the much smaller list hence the shorter time.

Hope that makes sense.

For Example

>>> A = [1,2,3,1,2,3,1,2,3,1,2,3]
>>> A = sorted(A)
[1,1,1,1,2,2,2,2,3,3,3,3]

>>> set(A)
{1, 2, 3}

On the other hand:

>>> A = [3,2,1,3,2,1,3,2,1,3,2,1]
>>> A = set(A)
{3, 2, 1}

>>> sorted(A)
{1, 2, 3}

And as @Barmar said, sorting is much slower than removing duplicates therefore there is a real time gain when you have to sort a much smaller list (1/4 of the list in my example above)

Time Benchmarking

a = [1,2,3] * 10000

set(sorted(a)) --> 0.1890001297
sorted(set(a)) --> 0.0079998970
Community
  • 1
  • 1
sshashank124
  • 31,495
  • 9
  • 67
  • 76
  • 1
    Also important is that sorting is slower than removing duplicates. – Barmar Apr 17 '14 at 07:46
  • So how slow is sort compared to set operation? – ysakamoto Apr 17 '14 at 08:02
  • @ysakamoto, That really depends on how ordered your list is and how many duplicate values are in there. Try running this demo I made a few times: http://repl.it/R3I – sshashank124 Apr 17 '14 at 08:06
  • But just saying "sorting is slower than removing duplicates" does not mean much. It is obvious from my benchmark. I want to know why sort is slower than set. – ysakamoto Apr 17 '14 at 08:09
  • @ysakamoto, because sorting involves a whole bunch of processes such as searching, deleting, inserting, shifting, etc. Whereas set just goes through a list and takes those variables that it wants. Something like `[i for i in input_list if i not in already_list]`. Something like that. – sshashank124 Apr 17 '14 at 08:12
  • @ysakamoto. If my answer was helpful, would you mind accepting it? Thank you. – sshashank124 Apr 17 '14 at 08:47
  • 1
    Because set uses an expected O(n) time algorithm using hash tables and sorting is Omega(n log n) and the constant factor is rather low in both cases – Niklas B. Apr 17 '14 at 15:18