Performance of sorted() and heapq functions in Python3

Question

I want to implement the following procedure in the fastest way using Python3: given a list of N random integers I need to return the K smallest ones (and I do not need the returned integers to be sorted). I implemented it in three different ways (as you can see in the code below).

the test_sorted() function use the built-in sorted() function to order the whole list of integers and then takes a slice of the first K elements. The cost of this operation should be essentially the cost of running the sorted() function, which has a time complexity of O(N log(N)).
the test_heap() function use a heap to store only the lowest K elements and returns them. Inserting an element on a heap has a time complexity of O(log(N)) and in theory the number of time we need to push an item in the heap is N. However, after the first K insertions we will be pushing and popping from the heap and I would expect that if the incoming element is bigger than any element in the heap no insertion would occur and time complexity should be somewhere between O(K log(N)) and O(N log(N)) (depending on the actual ordering of the input list). Anyway, even if my assumption is not true, the worst complexity should be O(N log(N)) (as usual, I consider negligible the cost of all the comparisons we need).
the test_nsmallest() function use the nsmallest() function from the heapq module. I had no expectation about this approach and since in the official python documentation I only found that

For larger values, it is more efficient to use the sorted() function. I decided to give it a try.

# test.py

from heapq import heappush, heappushpop, nsmallest
from random import randint
from timeit import timeit

N, K = 1000, 50
RANDOM_INTS = [randint(1,100) for _ in range(N)]

def test_sorted():
    return sorted(RANDOM_INTS)[:K]

def test_heap():
    heap = []
    for val in RANDOM_INTS:
        if len(heap) < K:
            heappush(heap, -val)
        else:
            heappushpop(heap, -val)
    return [-val for val in heap]

def test_nsmallest():
    return nsmallest(K, RANDOM_INTS)


def main():
    sorted_result = timeit("test_sorted()", globals=globals(), number=100_000)
    print(f"test_sorted took: {sorted_result}")

    heap_result = timeit("test_heap()", globals=globals(), number=100_000)
    print(f"test_heap took: {heap_result}")

    nsmallest_result = timeit("test_nsmallest()", globals=globals(), number=100_000)
    print(f"test_nsmallest took: {nsmallest_result}")

    r1, r2, r3 = test_sorted(), test_heap(), test_nsmallest()
    assert len(r1) == len(r2) == len(r3)
    assert set(r1) == set(r2) == set(r3)


if __name__ == '__main__':
    main()

The output on my (old) late 2011 MacBook Pro with a 2.4GHz i7 processor is as follows.

$ python --version
Python 3.9.2

$ python test.py 
test_sorted took: 8.389572635999999
test_heap took: 18.586762750000002
test_nsmallest took: 13.772040639000004

The simplest solution using sorted() its by far the best, can anyone elaborate on why the outcome does not match my expectation (i.e., that the test_heap() function should be at least a bit faster)? What am I missing?

If I run the same code with pypy the result is the opposite.

$ pypy --version
Python 3.7.10 (51efa818fd9b, Apr 04 2021, 12:03:51)
[PyPy 7.3.4 with GCC Apple LLVM 12.0.0 (clang-1200.0.32.29)]

$ pypy test.py 
test_sorted took: 7.1336525249998886
test_heap took: 3.1177806880004937
test_nsmallest took: 7.5453417899998385

And this is something closer to my expectations.

Provided that I know nothing about the python internals and I only have a very rough understanding of why pypy is faster than python, can anyone elaborate on those results and add some information about what is going on in order to allow me to correctly foresee the best choice for similar situations in the future?

Also, if you have any suggestion about other implementations that run faster than the above ones, please feel free to share!

UPDATE:

What if we need to sort the input list according to some criterion that is not the value of the item itselves (as I actually need in my real use case; the above is just a simplification)? Well, in this case the outcome is even more surprising:

# test2.py

from heapq import heappush, heappushpop, nsmallest
from random import randint
from timeit import timeit


N, K = 1000, 50
RANDOM_INTS = [randint(1,100) for _ in range(N)]


def test_sorted():
    return sorted(RANDOM_INTS, key=lambda x: x)[:K]

def test_heap():
    heap = []
    for val in RANDOM_INTS:
        if len(heap) < K:
            heappush(heap, (-val, val))
        else:
            heappushpop(heap, (-val, val))
    return [val for _, val in heap]

def test_nsmallest():
    return nsmallest(K, RANDOM_INTS, key=lambda x: x)


def main():
    sorted_result = timeit("test_sorted()", globals=globals(), number=100_000)
    print(f"test_sorted took: {sorted_result}")

    heap_result = timeit("test_heap()", globals=globals(), number=100_000)
    print(f"test_heap took: {heap_result}")

    nsmallest_result = timeit("test_nsmallest()", globals=globals(), number=100_000)
    print(f"test_nsmallest took: {nsmallest_result}")

    r1, r2, r3 = test_sorted(), test_heap(), test_nsmallest()
    assert len(r1) == len(r2) == len(r3)
    assert set(r1) == set(r2) == set(r3)


if __name__ == '__main__':
    main()

Which outputs:

$ python test2.py 
test_sorted took: 18.740868524
test_heap took: 27.694126547999996
test_nsmallest took: 25.414596833000004

$ pypy test2.py 
test_sorted took: 65.88409741500072
test_heap took: 3.9442632220016094
test_nsmallest took: 19.981832798999676

This tells me at least two things:

using an external key for sorting is extremely expensive, both when you provide a lambda function using the key kwarg and when you need to build a tuple (sorting_value, actual_value) to obtain the desired ordering in the heap.
Using lambdas with pypy seems to be extremely expensive, but I don't know why... maybe pypy can not optimize them and this doesn't play along with the other optimisations it performs???

Your `test_heap` has the overhead of using straight python, while `test_sorted` is implemented in C and only has the small overhead of turning the input argument and output result from/into python objects. — Ted Klein Bergman, May 12 '21 at 11:59
Also, the built-in `sort` is highly optimized to handle pre-sorted sequences in the input, which are likely to be common given how you construct the input. — chepner, May 12 '21 at 18:20

score 1 · Accepted Answer · answered May 12 '21 at 23:36

You are sorting a small array using the CPython interpreter and the PyPy just-in-time compiler. As a result, many complex overheads appear. Built-in calls are likely faster than manually written pure-python code containing on loops.

Asymptotic complexity only apply on big values because of missing constant factors: an O(n log2(n) + 30 n) algorithm will likely be slower than a O(2 n log2(n)) algorithm in practice for n < 1 000 000 000 while both are O(n log2(n))... The practical factors are hard to know as many important hardware effects should be taken into account.

Moreover, for the Heapsort, all items must be inserted to the heap so you can get correct results (the one you do not add can be the minimum). This can be done in O(n) time. So to get the first k values in a n-sized list, the complexity is O(k log(n) + n) (without taking into account the hidden constants).

The simplest solution using sorted() its by far the best, can anyone elaborate on why the outcome does not match my expectation (i.e., that the test_heap() function should be at least a bit faster)?

sorted is a very optimized built-in function. Python uses the very fast Timsort algorithm. Timsort is generally faster than a naive Heapsort. This is why it is faster than nsmallest despite the complexity. Moreover, your Heapsort is written in pure-python.

Additionally, in CPython, most of the time of the three implementations is the overhead of handling the sorted list and creating a new one (about half the time on my machine). PyPy can mitigate the overheads but cannot totally remove them. Keep in mind that a Python list is a complex dynamic object with many memory indirection (required to store dynamically-typed objects inside it).

Provided that I know nothing about the python internals and I only have a very rough understanding of why pypy is faster than python, can anyone elaborate on those results and add some information about what is going on in order to allow me to correctly foresee the best choice for similar situations in the future?

The best solution is not to use Python lists when you can safely say that all the types inside it are native types: fixed-size integers, simple/double-precision floating-point numbers. Instead, use Numpy! However, keep in mind that Numpy/List conversions are quite slow.

Here, the fastest solution is to create directly a Numpy array of random integers using np.random.randint(0, 100, N) and then use a partition algorithm to retrieve the k-smallest numbers using np.partition(data, k)[:k]. You can sort the resulting k-sized array if needed. Note that using a heap is one way to perform a partition, but this is far from being the fastest algorithm (see QuickSelect for example). Finally, please note that there are fast O(n) sorting algorithms for integers like RadixSort.

Using lambdas with pypy seems to be extremely expensive, but I don't know why...

AFAIK, this case is a performance issue of PyPy (due to internal guards). The team is aware of this and plans to improve the performance of such cases in the future. The general rule of thumb is to avoid dynamic code as much as possible to get a fast execution (eg. pure-python objects such as list and dict as well as lambdas).

Performance of sorted() and heapq functions in Python3

1 Answers1