Why is this implementation of binary heap slower than that of Python's stdlib?

Question

I have been implementing my own heap module to help me understand the heap data structure. I understand how they work and are managed, but my implementation is significantly slower than the standard python heapq module while preforming a heap sort (for list of size 100,000, heapq takes 0.6s while my code takes 2s (was originally 2.6s, cut it down to 2s by taking len() statements out of percDown and passing through the length so it doesn't have to calcuate len every time the method calls itself). Here is my implementation:

def percDown(lst, start, end, node):
    #Moves given node down the heap, starting at index start, until the heap property is
    #satisfied (all children must be larger than their parent)
    iChild = 2 * start + 1
    i = start
    # if the node has reached the end of the heap (i.e. no children left),
    # return its index (we are done)
    if iChild > end - 1:
        return start
    #if the second child exists and is smaller than the first child, use that child index
    #for comparing later
    if iChild + 1 < end and lst[iChild + 1] < lst[iChild]:
        iChild += 1
    #if the smallest child is less than the node, it is the new parent
    if lst[iChild] < node:
        #move the child to the parent position
        lst[start] = lst[iChild]
        #continue recursively going through the child nodes of the
        # new parent node to find where node is meant to go
        i = percDown(lst, iChild, end, node)
    return i

popMin: pops the minimum value (lst[0]) and reorders the heap

def popMin(lst):
    length = len(lst)
    if (length > 1):
        min = lst[0]
        ele = lst.pop()
        i = percDown(lst, 0, length - 1, ele)
        lst[i] = ele
        return min
    else:
        return lst.pop()

heapify: turns a list into a heap in-place

def heapify(lst):
    iLastParent = math.floor((len(lst) - 1) / 2)
    length = len(lst)
    while iLastParent >= 0:
        ele = lst[iLastParent]
        i = percDown(lst, iLastParent, length, lst[iLastParent])
        lst[i] = ele
        iLastParent -= 1

sort: heapsorts the given list using the methods above (not in-place)

def sort(lst):
    result = []
    heap.heapify(lst)
    length = len(lst)
    for z in range(0, length):
        result.append(heap.popMin(lst))
    return result

Did I mistakenly add complexity to the algorithm/heap creation, or is it just the python heapq module being heavily optimized? I have a feeling it is the former, as 0.6s vs 2s is a huge difference.

See also: http://stackoverflow.com/questions/2501457/what-do-i-use-for-a-max-heap-implementation-in-python — Ciro Santilli OurBigBook.com, Jun 20 '15 at 06:13

score 6 · Answer 1 · answered Oct 13 '14 at 22:02

6

The Python heapq module uses a C extension. You cannot beat C code.

From the heapq module source code:

# If available, use C implementation
try:
    from _heapq import *
except ImportError:
    pass

Also see the _heapqmodule.c C source.

answered Oct 13 '14 at 22:02

Martijn Pieters

1,048,767
296
4,058
3,343

Ahh, that makes sense, I forgot some Python modules use C to speed up processing versus using Python itself. That explains where most of the difference came from, the rest being my code is far from perfect. Thanks! – chillNZ Oct 13 '14 at 22:11

score 2 · Accepted Answer · edited May 23 '17 at 11:43

0.6s vs. 2.6s is a bit less than 4x difference. Is that "too big"?

That's not enough information to answer. A 4x different certainly could be caused by getting the algorithm wrong… but there's really no way to tell without testing at different sizes.

If you get, say, only a 1.2x different at 1000, a 4x difference at 100000, and a 12x difference at 1000000, then your algorithmic complexity is most likely worse, which means you probably did get something wrong, and that's something you need to fix.

On the other hand, if it's about 4x different at all three sizes, then there's just a bigger constant multiplier in your overhead. Most likely because you've got an inner loop running in Python, while the (CPython) stdlib version is using the _heapq accelerator module that does that same loop in C, as explained in Martijn Pieters' answer. So, you didn't get anything wrong. You can probably micro-optimize a bit, but ultimately you're going to have to either rewrite the core of your code in C or run it in a JIT-optimized interpreter to get anywhere near as good as the stdlib. And really, if you're just writing this to understand the algorithm, you don't need to do that.

As a side note, you might want to try running the comparison in PyPy. Most of its stdlib is written in pure Python, with no special optimizations, but the optimized JIT compiler makes it almost as fast as the native C code in CPython. And that same JIT will be applied to your code, meaning your unoptimized code is often nearly as as as the native C code in CPython too. Of course there's no guarantee of that, and it doesn't change the fact that you always need to test at different sizes if you're trying to test for algorithmic complexity.

Good idea to test on the different sizes to compare complexity, I should've thought of that. I just tried it on the different sizes and my code is consistently around 4x slower than the heapq module. Thanks! I wasn't concerned that my code was not optimized to the extreme, more worried I'd missed the logic and complexity behind the algorithm, so I'm completely fine with my code being 4x slower - I'd never use my code over heapq anyway! — chillNZ, Oct 13 '14 at 22:21
@chillNZ: Well, one day you'll need a maxheap instead of a minheap, or a 1-heap instead of a 0-heap, and then it'll be interesting to compare a pure-Python maxheap to the C minheap wrapped in a DSU that inverts all the comparisons to see which one's slower. :) (I had exactly that problem a few years ago, and they both sucked… but it was my first real program where PyPy kicked CPython's ass in speed, so that was my solution.) — abarnert, Oct 13 '14 at 22:40
Hmm yeah, I actually am using my code over the python module to do an in-place maxheap heapsort now, since I need to add functionality that heapq doesn't have (moving the max element to the end of the list instead of just popping it, as well as creating a max heap in the first place). I don't think I'll be at the level to be worrying about in-place and other efficiency techniques for a while yet though, and once I'm there I'll hopefully be much more confident using my own code over stdlib code! (I wouldn't use the code I'm playing around with now which is what I meant in my first comment.) — chillNZ, Oct 13 '14 at 23:10

Why is this implementation of binary heap slower than that of Python's stdlib?

2 Answers2