9

I was wondering whether I should have my data structure as a set or a list. Mostly I will do set operations, but in the end I will need to sort it.

I wondered whether I should first make the set a list, and then use sorted(list(my_set)), or just sort the set immediately sorted(my_set). Arguably, I might consider a general "to list" phase, since having an ordered iterable at that point in time might make sense anyway.

So I decided to test it, expecting a list to be quicker.

Benchmarker:

import time
def sorter(x):
    t1 = time.time()
    for i in range(1000000):
        sorted(x)
    return time.time() - t1

Data:

one = range(1000)
a1 = list(one)
b1 = set(one)
sorter(a1)
# time: 16.5 s 
sorter(b1)
# time: 20.7 s

I then realized it might have to do with the fact that the elements are already in place, and remembered this amazing question & answer.

Then, I tried some random data:

two = numpy.random.randint(1, 1000, 1000)
a2 = list(two)
b2 = set(two)

With the results:

sorter(a2)
# time: 4min 49s
sorter(b2)
# time: 18.9 s

Huge difference, what is going on?

Bonus: It even appears at a timing of one minute, that sorted(set(a_list)) is impressively faster than sorted(a_list).

Indeed in the second case, there can be duplicates which would be filtered, and thus speed up sorting.

Community
  • 1
  • 1
PascalVKooten
  • 20,643
  • 17
  • 103
  • 160
  • @Rufflewind Bah, I should have checked the type. I always assumed `sorted` to return a list (since I only used it on a list naturally). Now I'm curious, if we were to loop over the set after sorting it, will that change the order? – PascalVKooten Dec 23 '14 at 00:10
  • @PascalVKooten Actually, it does return a list. – PascalVKooten Dec 23 '14 at 00:11
  • I retracted my comment because there may be legitimate reasons to have a *sorted version of a set*, but as you've discovered, a sorted set is no longer a set. – Rufflewind Dec 23 '14 at 00:12
  • 4
    A set will be mostly sorted by the hash key, which in the case of integers is the value itself. The Timsort algorithm in Python is good at recognizing already sorted sequences. – Mark Ransom Dec 23 '14 at 00:15
  • 5
    b2 will probably be significantly shorter than a2. This doesn't explain the whole effect, but it's important to notice that you're not working with comparable input sizes when you're timing these two operations – Jon Kiparsky Dec 23 '14 at 00:18
  • Arguably, going from set to sorted list can't be any slower than from set to list and then to sorted list, given that the latter involves an extra step. – Rufflewind Dec 23 '14 at 00:18
  • @Rufflewind As mentioned in the bonus, there is an extra step involved by making a set of the list before sorting it, which is faster than to simply sort the list. – PascalVKooten Dec 23 '14 at 00:20
  • @JonKiparsky So with classes with some attributes this wouldn't hold? – PascalVKooten Dec 23 '14 at 00:21
  • @PascalvKooten If your universe of classes can't contain duplicates, then your lists and your sets will be of the same length, and part of the effect you're asking about will be removed. – Jon Kiparsky Dec 23 '14 at 00:26
  • Interestingly, even after taking into account the issues mentioned by Paul McGuire and Jon Kiparsky, `list(sorted(set(x)))` is still faster than `list(sorted(x))` by a factor of about 2-3. Same behavior on both Python 2 and 3. – Rufflewind Dec 23 '14 at 00:31
  • @PascalvKooten I was referring to your original comment about how you were performing "mainly set operations", so presumably you are starting from a set to begin with. – Rufflewind Dec 23 '14 at 00:31
  • 1
    @PascalvKooten I'm not an expert in benchmarking, but I expect that if you want to time sorting of sets and lists a little more fairly, you could shuffle a range(1000), and then take the result as a set or as a list. This would at least have you starting from the same N. – Jon Kiparsky Dec 23 '14 at 00:33
  • @PaulMcGuire: `sorted()` always returns a list in both Python 2 and 3. – jfs Dec 23 '14 at 00:34
  • @JonKiparsky I agree, it is sloppy. Though when it happens 3 times in a row and it is such a huge difference, I'm pretty confident something else is going on. Also, the numpy does the shuffle here basically, but more perfectly would be to take a new random sample for each of the million iterations. – PascalVKooten Dec 23 '14 at 00:36
  • Use `list(range(1000))` and the `random.shuffle` it. (But you still get the same discrepancy as I said.) – Rufflewind Dec 23 '14 at 00:37
  • @PascalvKooten I agree. I'm just pointing out that unless you get rid of other factors, you haven't a prayer of figuring out what that something else is. (I think Mark Ransom has given you the best clue so far - you should try lists and sets of objects rather than integers) – Jon Kiparsky Dec 23 '14 at 00:38
  • @Rufflewind range(1000) is already a list in Python 2, and numpy here (second example) also provides a single random sample. – PascalVKooten Dec 23 '14 at 00:38
  • 1
    @PascalvKooten You want a random sample of *unique* elements. `numpy.random.randint` doesn't guarantee that. – Rufflewind Dec 23 '14 at 00:39
  • 1
    From my tests, using a nontrivial data type such as `(int, int)` reverses the trend seen here, though using an intermediate set only adds a minor inefficiency (~10%). I suspect the reason why using an intermediate set is faster for integers is because the set construction process automatically puts every integer in the correct order (or very nearly so) *without sorting them* because the trivial hash that Python uses. – Rufflewind Dec 23 '14 at 01:09
  • Keep in mind sets contain unique values, you may be comparing a very small set with a huge list. You shouldn't use random numbers in this case. – DiogoNeves May 20 '15 at 17:36

1 Answers1

3

I extended your code a bit and hope that this will give you insight into what is happening:

import numpy
import uuid
import random
import time

def sorter(x):
    t1 = time.time()
    for i in range(10000):
        sorted(x)
    return time.time() - t1

def pr(name, x):
    print('sorter {:<12s} {:<11} (length {:>4})'.format(
        name, '{:.8}'.format(sorter(x)), len(x)))

a2sizes = []
b2sizes = []

for x in range(1000):
    two = numpy.random.randint(1, 1000, 1000)
    a2 = list(two)
    b2 = set(two)
    a2sizes.append(len(a2))
    b2sizes.append(len(b2))

print 'average number of elements in a2', sum(a2sizes)/len(a2sizes)
n = sum(b2sizes)/len(b2sizes)
print 'average number of elements in b2', n

this prints out:

average number of elements in a2 1000
average number of elements in b2 632

This is because of the collision in the random number range

print
pr('a2', a2)
# making a list of set gives you already sorted elements
y = list(b2)
pr('y', y)
random.shuffle(y)
pr('shuffled y ', y)
pr('b2', b2)

gives as output:

sorter a2           2.492537    (length 1000)
sorter b2           0.25028086  (length  633)
sorter y            0.19689608  (length  633)
sorter shuffled y   1.4935901   (length  633)

That b2 would be faster because there are less elements is logical, but that this is even faster if you first make a list of the set must have some reason. That it again is slower if you shuffle that list is again logical and the shuffled result is rather close to the result for a2 when compensated for the length of the lists.

So lets try to put something else in the list:

b3 = set()
for x in range(1000):
    b3.add(uuid.uuid4())

print '\nuuid elements', len(b3)

a3 = list(b3)
pr('a3', a3)
random.shuffle(a3)
pr('shuffled a3', a3)
pr('b3', b3)

gives (I would have been rather surprised to have less than 1000 elements):

uuid elements 1000
sorter a3           32.437758   (length 1000)
sorter shuffled a3  32.178433   (length 1000)
sorter b3           32.163802   (length 1000)

So it must have something to do with having numbers in the set:

previous = -1
ordered = True
for popped in b2:
    if popped < previous:
        print 'popped', popped, previous
        ordered = False
    previous = popped

print '\nOrdered', ordered

gives you:

Ordered True

Instead of iterating , a set has a pop() function you can try and use:

pop()

Remove and return an arbitrary element from the set. Raises KeyError if the set is empty.

So lets arbitrarily retrieve elements from the set b2 and see if there is something special:

previous = -1
ordered = True
while(b2):
    popped = b2.pop()
    if popped < previous:
        print 'popped', popped, previous
        ordered = False
    previous = popped

print '\nOrdered', ordered

gives the same result:

Ordered True

So arbitrarily retrieving elements of set of number retrieves those numbers in order, independent off the ordering how these numbers were put in. As the iteration is how the list-making retrieves an element at a time to append to the list, the result of list(b2) is an ordered list and that gets sorted quite fast using the Timsort algorithm used in Python.

Anthon
  • 69,918
  • 32
  • 186
  • 246