Efficiency of sorting by multiple keys in Python

Question

I have a list of strings that I want to sort by two custom key functions in Python 3.6. Comparing the multi-sort approach (sorting by the lesser key then by the major key) to the multi-key approach (taking the key as the tuple (major_key, lesser_key)), I could see the latter being more than 2x slower than the former, which was a surprise as I thought they are equivalent. I would like to understand why this is so.

import random
from time import time

largest = 1000000
length = 10000000
start = time()
lst = [str(x) for x in random.choices(range(largest), k=length)]
t0 = time() - start

start = time()
tmp = sorted(lst, key=lambda x: x[::2])
l1 = sorted(tmp, key=lambda x: ''.join(sorted(x)))
t1 = time() - start

start = time()
l2 = sorted(lst, key=lambda x: (''.join(sorted(x)), x[::2]))
t2 = time() - start

print(f'prepare={t0} multisort={t1} multikey={t2} slowdown={t2/t1}')

assert l1 == l2

In the first case you need to create 1 extra `list` whereas in the second case you need to create N(`10,000,000`) extra `tuple`s. That seems to be most of the cost. — Axe319, Oct 06 '21 at 15:45
@wim the numpy initialization is much faster + you should not edit a code i am asking about + never ever edit someone else's code without permission. — o17t H1H' S'k, Oct 06 '21 at 19:16
@eyaler Removing the numpy dependency when it's irrelevant to the question avoids the extra mental overhead of trying to figure out whether whatever numpy dtype the platform uses for integers is related (it's not, since they are all converted to strings anyway). I think that benefit outweighs a slower one-time setup. And on stack overflow we don't have to ask permission to make edits, that's not how the site works. — wim, Oct 06 '21 at 19:47
@eyaler My way with `random.choices` seems to be a bit faster than with numpy, so maybe switch to that? I also prefer to not use numpy unless I really need to. For example at tio.run where I do most of my benchmarking demos, I'd need to switch to their older Python 3 version in order to use numpy. — Stefan Pochmann, Oct 06 '21 at 20:27
You *almost* did it :-). Your list comprehension runs Python code for every value and looks up `str` every time. My way with `map` is a bit faster. — Stefan Pochmann, Oct 07 '21 at 14:14
@StefanPochmann I measured the difference and it was small. i never use map/filter/reduce in python — o17t H1H' S'k, Oct 08 '21 at 01:45

Tim Peters · Accepted Answer · 2021-10-18T18:53:22.860

Here's a third way to time:

start = time()
l3 = sorted(lst, key=lambda x: (''.join(sorted(x)) + "/" + x[::2]))
t3 = time() - start

and expanding the last line to

assert l1 == l2 == l3

This uses a single string as the key, but combining the two string keys you view as as being the "primary" and "secondary" keys. Note that:

>>> chr(ord("0") - 1)
'/'

That's why the two keys can be combined - they're separated by an ASCII character that compares "less than" any ASCII digit (of course this is wholly specific to the precise kind of keys you're using).

This is typically a little faster than multisort() for me, using the precise program you posted.

prepare=3.628943920135498 multisort=15.646344423294067 multikey=34.255955934524536 slowdown=2.1893903782103075 onekey=15.11461067199707

I believe the primary reason "why" is briefly explained at the end of a modern CPython distribution's Objects/listsort.txt:

As noted above, even the simplest Python comparison triggers a large pile of C-level pointer dereferences, conditionals, and function calls. This can be partially mitigated by pre-scanning the data to determine whether the data is homogeneous with respect to type. If so, it is sometimes possible to substitute faster type-specific comparisons for the slower, generic PyObject_RichCompareBool.

When there's a single string used as the key, this pre-sorting scan deduces that all the keys in the list are in fact strings, so all the runtime expense of figuring out which comparison function to call can be skipped: sorting can always call the string-specific comparison function instead of the all-purpose (and significantly more expensive) PyObject_RichCompareBool.

multisort() also benefits from that optimization.

But multikey() doesn't, much. The pre-sorting scan sees that all the keys are tuples, but the tuple comparison function itself can't assume anything about the types of the tuple's elements: it has to resort to PyObject_RichCompareBool every time it's invoked. (Note: as touched on in comments, it's not really quite that simple: some optimization is still done exploiting that the keys are all tuples, but it doesn't always pay off, and at best is less effective - and see the next section for clearer evidence of that.)

Focus

There's a whole lot going on in the test case, which leads to needing ever-greater effort to explain ever-smaller distinctions.

So to look at the effects of the type homogeneity optimizations, let's simplify things a lot: no key function at all. Like so:

from random import random, seed
from time import time

length = 10000000
seed(1234567891)
xs = [random() for _ in range(length)]

ys = xs[:]
start = time()
ys.sort()
e1 = time() - start

ys = [(x,) for x in xs]
start = time()
ys.sort()
e2 = time() - start

ys = [[x] for x in xs]
start = time()
ys.sort()
e3 = time() - start
print(e1, e2, e3)

Here's typical output on my box:

3.1991195678710938 12.756590843200684 26.31903386116028

So it's by far fastest to sort floats directly. It's already very damaging to stick the floats in 1-tuples, but the optimization still gives highly significant benefit: it takes over twice as long again to stick the floats in singleton lists. In that last case (and only in that last case), PyObject_RichCompareBool is always called.

Hmm, somewhat disappointing, [unsafe_tuple_compare's introduction](https://github.com/python/cpython/blob/54a4e1b53a18f0c7420ba03de9608194c4413fc2/Objects/listobject.c#L2176-L2183) makes it sound more optimized than it really is. Those darn nans... — Stefan Pochmann, Oct 18 '21 at 08:44
Yes, looks like the first loop there was copied verbatim from `tuplerichcompare`, and doesn't save any calls to `PyObject_RichCompareBool`. If the initial tuple elements aren't equal, _then_ it gets out of the loop after the first such call, and saves (compared to plain old tuple comparison) another (`Py_LT`) call to `PyObject_RichCompareBool`. — Tim Peters, Oct 18 '21 at 17:03
omg this is so awesome. imagine asking a question about quantum electrodynamics and getting an answer by richard feynman! anyway, @TimPeters thanks for your explanation, i did kinda hope you jump in. you demonstrated some extreme performance differences, and in my own original use case (exhausting ~100% ram and using paging) i could see slowdowns by factors in the hundreds. therefore, i would be happy to know your opinion: (1) is it worthwhile to suggest documenting this behavior/gotchas in the docs? (2) does it make sense to suggest having Python automatically convert multikey into multisort? — o17t H1H' S'k, Oct 18 '21 at 22:10
@TimPeters I wrote a big update in my answer now, looking at the different numbers of comparisons. Thanks for the motivation :-) — Stefan Pochmann, Oct 19 '21 at 04:33
@eyaler, I think you'd have to bring these up on the "python-ideas" mailing list. None of this is about Python the language, but quirks specific to the CPython implementation of Python's list sorting. About the only thing defined about Python's sorting is that it's stable. Perhaps some CPython-specifc notes in Python's "Sorting HOW TO" docs? Regardless, I opened an issue report (https://bugs.python.org/issue45530) against CPython to improve the performance of `unsafe_tuple_compare()`. — Tim Peters, Oct 20 '21 at 00:48
@TimPeters thanks for addressing this! i tried to look into the bug but it's a bit over my pay grade... is your patch expected to also make the difference between multisort and multikey negligible? — o17t H1H' S'k, Oct 20 '21 at 20:36
@eyaler, afraid not. _You_ know whether you're using a `key=` argument to do a multi-key sort, but Python has no idea - it just sees that you're asking to derive a key by calling a function, which may do anything whatsoever. In fact, for the specific program you posted, it probably won't make a difference. As Stefan Pochmann said, the speed advantage the current sorting algorithm can get from many duplicates in a key is lost when that key is buried in a tuple of keys. The change would speed cases where tuple keys are used that _don't_ have many duplicates in the first position. — Tim Peters, Oct 20 '21 at 20:43

Stefan Pochmann · Answer 2 · 2021-10-20T06:11:33.543

Update

Tim and a_guest nudged me to analyze more. Here are counts of the different comparisons (per element) and other statistics:

               length:     10,000       100,000      1,000,000     10,000,000
                      | msort  mkey | msort  mkey | msort  mkey | msort  mkey |
----------------------+-------------+-------------+-------------+-------------+
 < ''.join(sorted(x)) | 11.64 11.01 | 13.86 11.88 | 13.10 12.00 | 12.16 12.06 |
 < x[::2]             | 11.99  0.96 | 13.51  3.20 | 13.77  5.35 | 13.79  6.68 |
== ''.join(sorted(x)) |       11.99 |       15.29 |       18.60 |       21.26 |
== x[::2]             |        0.98 |        3.42 |        6.60 |        9.20 |
----------------------+-------------+-------------+-------------+-------------+
time, μs per element  |  0.84  1.03 |  0.95  1.42 |  1.31  2.35 |  1.39  3.15 |
----------------------+-------------+-------------+-------------+-------------+
tracemalloc peak MiB  |  0.82  1.67 |  8.26 17.71 |  82.7 178.0 |   826  1780 |
----------------------+-------------+-------------+-------------+-------------+
sys.getsizeof MiB     |  0.58  1.80 |  5.72 17.82 |  57.6 179.5 |   581  1808 |
                      |  0.60       |  6.00       |  60.4       |   608       |
----------------------+-------------+-------------+-------------+-------------+

Numbers of comparisons

The numbers of comparisons are per element, i.e., I counted all comparisons and divided by the list size. So for example for a list with a million elements, multisort did 13.77 < comparisons per x[::2] string and then 13.10 < comparisons per ''.join(sorted(x)) string. Whereas multikey has many == comparisons of the strings and fewer < comparisons of the strings. As Tim pointed out, unsafe_tuple_compare uses the slower PyObject_RichCompareBool(..., Py_EQ) before it uses the faster unsafe_latin_compare (on the ''.join(sorted(x)) strings) or another slower PyObject_RichCompareBool(..., Py_LT) (on the x[::2] strings).

A key thing to note is that multisort's numbers of comparisons per element stay roughly constant and it only uses the faster unsafe_latin_compare. Whereas multikey's numbers of comparisons other than the < comparisons on ''.join(sorted(x)) grow rapidly, including the additional slower equality comparisons.

That has to do with your data, as x is the string representation of an integer from 0 to 1,000,000. That causes more and more duplicates, both for ''.join(sorted(x)) (which also has duplicates from the sorting, e.g., 314159 and 911345 both become '113459') and for x[::2] (which also has duplicates from the slicing, e.g., 123456 and 183957 both become '135').

For multisort this is good, as duplicates mean less work for it - equal things dont need to be "sorted further" (and I think streaks of equal things are good for Timsort's galloping).

But multikey suffers from from the duplicates, because the == comparisons on ''.join(sorted(x)) result in "they're equal" more often, leading to more comparisons of the x[::2] strings. And these x[::2] often turn out unequal, note how the numbers for < x[::2] are relatively large compared to the == x[::2] comparisons (e.g., at ten million elements, there were 9.20 == comparisons and 6.68 < comparisons, so 73% of the time they were unequal). Which means the tuples also often turn out unequal, so they do need to be "sorted further" (and I think it's bad for galloping). And this further sorting will compare whole tuples again, also meaning even more == comparisons of the ''.join(sorted(x)) strings even though they're equal! That's why in multikey, the < comparisons of the ''.join(sorted(x)) strings remain fairly stable (from 11.01 per string for 10,000 elements, up to only 12.06 for 10 million elements) while their == comparisons grow so much (from 11.99 up to 21.26).

Times and Memory

The runtimes in the above table also reflect the little growth for multisort and the large growth for multikey. The one time where multisort got really quite a bit slower was at the step from 100,000 to 1,000,000 elements, going from 0.95 μs to 1.31 μs per element. As can be seen from the tracemalloc and sys.getsizeof rows of my table, its memory stepped up from ~6 MiB to ~59 MiB, and the CPU has about 34 MiB cache, so cache misses might be responsible for that.

For the sys.getsizeof values I put all keys into a list and add the size of the list (since sorted also stores them) and the sizes of all strings/tuples in the list. For multisort, my table shows two values separately, as the two sorts happen one after the other, and the memory for the first sort is released before the second sort.

(Note: runtimes differ from my original answer as I had to use a different computer to sort ten million elements.)

Code for counting comparisons

I wrap the strings in an S object that increases a counter for each comparison.

import random
from collections import Counter

class S:
    def __init__(self, value, label):
        self.value = value
        self.label = label
    def __eq__(self, other):
        comparisons['==', self.label] += 1
        return self.value == other.value
    def __ne__(self, other):
        comparisons['!=', self.label] += 1
        return self.value != other.value
    def __lt__(self, other):
        comparisons['<', self.label] += 1
        return self.value < other.value
    def __le__(self, other):
        comparisons['<=', self.label] += 1
        return self.value <= other.value
    def __ge__(self, other):
        comparisons['>=', self.label] += 1
        return self.value >= other.value
    def __gt__(self, other):
        comparisons['>', self.label] += 1
        return self.value > other.value

def key1(x):
    return S(''.join(sorted(x)), "''.join(sorted(x))")

def key2(x):
    return S(x[::2], 'x[::2]')

def multisort(l):
    tmp = sorted(l, key=lambda x: key2(x))
    return sorted(tmp, key=lambda x: key1(x))

def multikey(l):
    return sorted(l, key=lambda x: (key1(x), key2(x)))

funcs = [
    multisort,
    multikey,
]

def test(length, rounds):
    print(f'{length = :,}')

    largest = 1000000

    for _ in range(3):
        lst = list(map(str, random.choices(range(largest + 1), k=length)))
        for func in funcs:
            global comparisons
            comparisons = Counter()
            func(lst)
            print(func.__name__)
            for comparison, frequency in comparisons.most_common():
                print(f'{frequency / length:5.2f}', *comparison)
            print()

test(10_000, 1)
test(100_000, 10)
test(1_000_000, 1)
test(10_000_000, 1)

Original answer

Yeah, I've often used / pointed out that two simpler sorts can be faster than one more complex sort. Here are benchmarks with a few more versions:

At length = 10,000 elements, multikey took about 1.16 times as long as multisort:

multisort           12.09 ms  12.21 ms  12.32 ms  (no previous result)
multikey            14.13 ms  14.14 ms  14.14 ms  (same as previous result)
multisort_tupled    15.40 ms  15.61 ms  15.70 ms  (same as previous result)
multisort_inplaced  11.46 ms  11.49 ms  11.49 ms  (same as previous result)

At length = 100,000, it took about 1.43 times as long:

length = 100,000
multisort           0.162 s  0.164 s  0.164 s  (no previous result)
multikey            0.231 s  0.233 s  0.237 s  (same as previous result)
multisort_tupled    0.222 s  0.225 s  0.227 s  (same as previous result)
multisort_inplaced  0.156 s  0.157 s  0.158 s  (same as previous result)

At length = 1,000,000, it took about 2.15 times as long:

multisort           1.894 s  1.897 s  1.898 s  (no previous result)
multikey            4.026 s  4.060 s  4.163 s  (same as previous result)
multisort_tupled    2.734 s  2.765 s  2.771 s  (same as previous result)
multisort_inplaced  1.840 s  1.841 s  1.863 s  (same as previous result)

Reasons I see:

Building tuples costs extra time, and comparing tuples of things is slower than comparing just those things. See the times of multisort_tupled, where in one of the two sorts I wrap each real key in a single-value tuple. That made it much slower.
With large data, cpu/memory caches play a more significant role. Having extra tuples with two things per tuple is three times as many objects in memory as just having one of those things. Leads to more cache misses. Or even to (more) swap file usage if you go real big.
~~Tuples are registered for the reference cycle garbage collector, which can play a significant role. We don't produce reference cycles here, so we can disable it during the sorting. Speeds it up a bit.~~ Edit: I still think it should be at least slightly faster, but with more testing I got conflicting times regarding this, so took this out.

Btw, notice my multisort_inplaced is a bit faster still. If you do multiple sorts, it's pointless to do them all with sorted. After the first one, just use inplace sort. No reason to create further new lists, which takes time/memory for the lists and takes time to update the reference counts for all the list elements.

Benchmark code (Try it online!):

import random
from time import perf_counter as timer

def multisort(l):
    tmp = sorted(l, key=lambda x: x[::2])
    return sorted(tmp, key=lambda x: ''.join(sorted(x)))

def multikey(l):
    return sorted(l, key=lambda x: (''.join(sorted(x)), x[::2]))

def multisort_tupled(l):
    tmp = sorted(l, key=lambda x: x[::2])
    return sorted(tmp, key=lambda x: (''.join(sorted(x)),))

def multisort_inplaced(l):
    tmp = sorted(l, key=lambda x: x[::2])
    tmp.sort(key=lambda x: ''.join(sorted(x)))
    return tmp

funcs = [
    multisort,
    multikey,
    multisort_tupled,
    multisort_inplaced,
]

def test(length, rounds):
    print(f'{length = :,}')

    largest = 1000000
    lst = list(map(str, random.choices(range(largest + 1), k=length)))

    prev = None
    for func in funcs:
        times = []
        for _ in range(5):
            time = 0
            for _ in range(rounds):
                copy = lst.copy()
                start = timer()
                result = func(copy)
                time += timer() - start
            times.append(time / rounds)
        print('%-19s' % func.__name__,
              *('%5.2f ms ' % (t * 1e3) for t in sorted(times)[:3]),
              # *('%5.3f s ' % t for t in sorted(times)[:3]),
              '(%s previous result)' % ('no' if prev is None else 'same as' if result == prev else 'differs from'))
        prev = result

test(10_000, 10)
# test(100_000, 10)
# test(1_000_000, 1)

yes i am actually using a second in place sort for my application but didn't want that to confuse with the difference. also i think that indeed cache may be important as i was hitting this with a much more extreme difference when using 100% ram and paging kicks in. — o17t H1H' S'k, Oct 06 '21 at 19:20
Could one of you wizards explain why inplacing makes a difference here? I seem to remember that if there is a `key` then the whole sequence is mapped through it before sorting, no? — loopy walt, Oct 06 '21 at 20:00
@loopywalt Like I said: `sorted` creates another list, which takes time/memory for the list and takes time to update the reference counts for all the list elements. — Stefan Pochmann, Oct 06 '21 at 20:03
Thanks, @StefanPochmann. makes sense. I guess I was just surprised that on top of the mapping, sorting the mapped list, shuffling the original list accordingly, allocating a bit more space should make such a dent. I didn't think of the ref counts, but again, surprised they make so much of a difference. — loopy walt, Oct 06 '21 at 20:13
@loopywalt That's not a dent ... [that's](https://stackoverflow.com/questions/42107442/why-is-copying-a-shuffled-list-much-slower) a dent [:-)](https://www.youtube.com/watch?v=iQrLPtr_ikE#t=24s) — Stefan Pochmann, Oct 06 '21 at 21:55
Thanks, @StefanPochmann very educational---the first link, anyway ;-] — loopy walt, Oct 06 '21 at 23:02
@StefanPochmann For better comparability, the `multisort_tupled` should also use a 2-tuple as key, e.g. `(''.join(sorted(x)), 0)`. Then you could also compare the number of swaps that need to be made for each of the sorts (via a custom `Key` class that wraps the key returned by the `key` functions and implements `__lt__` in order to count how many "successful" `<` comparisons there are). However I think in the end it all comes down to the details of how Timsort works under the hood and why that's favorable for the specific data. — a_guest, Oct 07 '21 at 07:11
@a_guest Hmm, I'm not convinced that would be better. I'm trying to somewhat simulate `multikey` there, which for each comparison involves one tuple and one or two strings. My way, `multisort` already involves one tuple and two string. That it does two strings instead of just one or two, that seems appropriate to me, as that's a real difference between the two methods. But why should I additionally involve an int? Nice idea to count the "swaps", or rather *comparisons*, though. I might do that. — Stefan Pochmann, Oct 07 '21 at 13:57
@a_guest I did the counting of comparisons now and thought it through further, see the big update. — Stefan Pochmann, Oct 19 '21 at 04:30
@StefanPochmann Good analysis! Initially, I didn't think of the fact that for tuples `__lt__` is not sufficient in order to determine the outcome of `<`, they also need to invoke `__eq__` on their elements. Regarding this, do you know by chance what's the rational behind calling `__eq__` first and `__lt__` second when comparing `(x,) < (y,)`? — a_guest, Oct 19 '21 at 08:32
@a_guest That's how tuple comparison is done in general - find the first difference and then check what kind of difference it is. And while the tuple comparisons in sorting could be *implemented* differently, they of course should lead to the same *results*. Which they might not if we trust `<` over `==`. The original suggestion for these optimizations actually did use `x < y` first, and if that was false, `y < x`, and if that was false, conclude they're equal, otherwise continue comparing at index 1. But [Tim pointed out](https://github.com/python/cpython/pull/582#issuecomment-285838656) ... — Stefan Pochmann, Oct 19 '21 at 11:16
... that this deviates from normal results for `nan`s. That's what I meant in my comment under his answer :-). I haven't tested it, but I think `unsafe_tuple_compare` is also be used if I sort two values of my own class, whose instances could always return `True` both for `<` and for `==`, in which case trusting the `<` without `==` again would deviate from the normal results. That said, I think the function *could* be optimized for known "well-behaving" types, and I might try and suggest that. — Stefan Pochmann, Oct 19 '21 at 11:16
@a_guest [Tim opened an issue for that now](https://stackoverflow.com/questions/69468552/efficiency-of-sorting-by-multiple-keys-in-python/69610671?noredirect=1#comment123091311_69610671). — Stefan Pochmann, Oct 20 '21 at 02:30