Why is my MergeSort so slow in Python?

Question

I'm having some troubles understanding this behaviour. I'm measuring the execution time with the timeit-module and get the following results for 10000 cycles:

Merge : 1.22722930395
Bubble: 0.810706578175
Select: 0.469924766812

This is my code for MergeSort:

def mergeSort(array):
    if len(array) <= 1:
        return array
    else:
        left = array[:len(array)/2]
        right = array[len(array)/2:]
        return merge(mergeSort(left),mergeSort(right))

def merge(array1,array2):
    merged_array=[]
    while len(array1) > 0 or len(array2) > 0:

        if array2 and not array1:
            merged_array.append(array2.pop(0))

        elif (array1 and not array2) or array1[0] < array2[0]:
            merged_array.append(array1.pop(0))

        else:
            merged_array.append(array2.pop(0))
    return merged_array

Edit:

I've changed the list operations to use pointers and my tests now work with a list of 1000 random numbers from 0-1000. (btw: I changed to only 10 cycles here)

result:

Merge : 0.0574434420723
Bubble: 1.74780097558
Select: 0.362952293025

This is my rewritten merge definition:

def merge(array1, array2):
    merged_array = []
    pointer1, pointer2 = 0, 0
    while pointer1 < len(array1) and pointer2 < len(array2):
        if array1[pointer1] < array2[pointer2]:
            merged_array.append(array1[pointer1])
            pointer1 += 1
        else:
            merged_array.append(array2[pointer2])
            pointer2 += 1
    while pointer1 < len(array1):
        merged_array.append(array1[pointer1])
        pointer1 += 1

    while pointer2 < len(array2):
        merged_array.append(array2[pointer2])
        pointer2 += 1

    return merged_array

seems to work pretty well now :)

Im sure its because of the recursion your doing. I think your adding huge overhead. — Jakob Bowyer, Aug 15 '11 at 10:01
Recursion and list pop/append operations; python may hide the gritty details from you but they still take a ton of cpu time :) — Torp, Aug 15 '11 at 10:10
Show us your other sorts too. Basically though, @Jakob is right (So is @Torp), function calls, attribute access, and non-local variable lookups are slow. You could try putting `merge` first and doing `def mergeSort(array, merge=merge):` — agf, Aug 15 '11 at 10:10
Why would anyone _ever_ implement another sorting algorithm anyway, when you've already got [Timsort](https://secure.wikimedia.org/wikipedia/en/wiki/Timsort)? It's a mergesort anyway. — agf, Aug 15 '11 at 10:25
@agf Hell why use timsort. We have [sleepsort](http://beust.com/weblog/2011/06/15/sleep-sort/) — Jakob Bowyer, Aug 15 '11 at 10:33
But it's performance characteristics are only good when you only have a few distinct values! — agf, Aug 15 '11 at 10:37
Can you give your bubblesort implementation ? I can't get anywhere near those timing comparisons ! (see [my answer](http://stackoverflow.com/questions/7063697/why-is-my-mergesort-so-slow-in-python/7064487#7064487) with benchmark code below) — Francois G, Aug 15 '11 at 11:59

score 10 · Answer 1 · answered Aug 15 '11 at 10:27

list.pop(0) pops the first element and has to shift all remaining ones, this is an additional O(n) operation which must not happen.

Also, slicing a list object creates a copy:

left = array[:len(array)/2]
right = array[len(array)/2:]

Which means you're also using O(n * log(n)) memory instead of O(n).

I can't see BubbleSort, but I bet it works in-place, no wonder it's faster.

You need to rewrite it to work in-place. Instead of copying part of original list, pass starting and ending indexes.

Because of the `pop(0)` it would be faster to use a `deque`, and I bet it would be faster to get `right` by doing `del array[len(array)/2:]`? — agf, Aug 15 '11 at 10:40

score 6 · Accepted Answer · edited May 23 '17 at 11:46

For starters : I cannot reproduce your timing results, on 100 cycles and lists of size 10000. The exhaustive benchmark with timeit of all implementations discussed in this answer (including bubblesort and your original snippet) is posted as a gist here. I find the following results for the average duration of a single run :

Python's native (Tim)sort : 0.0144600081444
Bubblesort : 26.9620819092
(Your) Original Mergesort : 0.224888720512

Now, to make your function faster, you can do a few things.

Edit : Well, apparently, I was wrong on that one (thanks cwillu). Length computation takes O(1) in python. But removing useless computation everywhere still improves things a bit (Original Mergesort: 0.224888720512, no-length Mergesort: 0.195795390606):

def nolenmerge(array1,array2):
    merged_array=[]
    while array1 or array2:
        if not array1:
            merged_array.append(array2.pop(0))
        elif (not array2) or array1[0] < array2[0]:
            merged_array.append(array1.pop(0))
        else:
            merged_array.append(array2.pop(0))
    return merged_array

def nolenmergeSort(array):
    n  = len(array)
    if n <= 1:
        return array
    left = array[:n/2]
    right = array[n/2:]
    return nolenmerge(nolenmergeSort(left),nolenmergeSort(right))

Second, as suggested in this answer, pop(0) is linear. Rewrite your merge to pop() at the end:

def fastmerge(array1,array2):
    merged_array=[]
    while array1 or array2:
        if not array1:
            merged_array.append(array2.pop())
        elif (not array2) or array1[-1] > array2[-1]:
            merged_array.append(array1.pop())
        else:
            merged_array.append(array2.pop())
    merged_array.reverse()
    return merged_array

This is again faster: no-len Mergesort: 0.195795390606, no-len Mergesort+fastmerge: 0.126505711079

Third - and this would only be useful as-is if you were using a language that does tail call optimization, without it , it's a bad idea - your call to merge to merge is not tail-recursive; it calls both (mergeSort left) and (mergeSort right) recursively while there is remaining work in the call (merge).

But you can make the merge tail-recursive by using CPS (this will run out of stack size for even modest lists if you don't do tco):
```
def cps_merge_sort(array):
    return cpsmergeSort(array,lambda x:x)

def cpsmergeSort(array,continuation):
    n  = len(array)
    if n <= 1:
        return continuation(array)
    left = array[:n/2]
    right = array[n/2:]
    return cpsmergeSort (left, lambda leftR:
                         cpsmergeSort(right, lambda rightR:
                                      continuation(fastmerge(leftR,rightR))))
```
Once this is done, you can do TCO by hand to defer the call stack management done by recursion to the while loop of a normal function (trampolining, explained e.g. here, trick originally due to Guy Steele). Trampolining and CPS work great together.

You write a thunking function, that "records" and delays application: it takes a function and its arguments, and returns a function that returns (that original function applied to those arguments).
```
thunk = lambda name, *args: lambda: name(*args)
```
You then write a trampoline that manages calls to thunks: it applies a thunk until the thunk returns a result (as opposed to another thunk)
```
def trampoline(bouncer):
    while callable(bouncer):
        bouncer = bouncer()
    return bouncer
```
Then all that's left is to "freeze" (thunk) all your recursive calls from the original CPS function, to let the trampoline unwrap them in proper sequence. Your function now returns a thunk, without recursion (and discarding its own frame), at every call:
```
def tco_cpsmergeSort(array,continuation):
    n  = len(array)
    if n <= 1:
        return continuation(array)
    left = array[:n/2]
    right = array[n/2:]
    return thunk (tco_cpsmergeSort, left, lambda leftR:
                  thunk (tco_cpsmergeSort, right, lambda rightR:
                         (continuation(fastmerge(leftR,rightR)))))

mycpomergesort = lambda l: trampoline(tco_cpsmergeSort(l,lambda x:x))
```

Sadly this does not go that fast (recursive mergesort:0.126505711079, this trampolined version : 0.170638551712). OK, I guess the stack blowup of the recursive merge sort algorithm is in fact modest : as soon as you get out of the leftmost path in the array-slicing recursion pattern, the algorithm starts returning (& removing frames). So for 10K-sized lists, you get a function stack of at most log_2(10 000) = 14 ... pretty modest.

You can do slightly more involved stack-based TCO elimination in the guise of this SO answer gives:

    def leftcomb(l):
        maxn,leftcomb = len(l),[]
        n = maxn/2
        while maxn > 1:
            leftcomb.append((l[n:maxn],False))
            maxn,n = n,n/2
        return l[:maxn],leftcomb

    def tcomergesort(l):
        l,stack = leftcomb(l)
        while stack: # l sorted, stack contains tagged slices
            i,ordered = stack.pop()
            if ordered:
                l = fastmerge(l,i)
            else:
                stack.append((l,True)) # store return call
                rsub,ssub = leftcomb(i)
                stack.extend(ssub) #recurse
                l = rsub
        return l

But this goes only a tad faster (trampolined mergesort: 0.170638551712, this stack-based version:0.144994809628). Apparently, the stack-building python does at the recursive calls of our original merge sort is pretty inexpensive.

The final results ? on my machine (Ubuntu natty's stock Python 2.7.1+), the average run timings (out of of 100 runs -except for Bubblesort-, list of size 10000, containing random integers of size 0-10000000) are:

Python's native (Tim)sort : 0.0144600081444
Bubblesort : 26.9620819092
Original Mergesort : 0.224888720512
no-len Mergesort : 0.195795390606
no-len Mergesort + fastmerge : 0.126505711079
trampolined CPS Mergesort + fastmerge : 0.170638551712
stack-based mergesort + fastmerge: 0.144994809628

wow thanks. Impressive what difference such tiny things as outfactoring the len function makes. I'll need to read a bit into the tail-recursive thing, though, since I don't fully understand it yet. — algorithms, Aug 15 '11 at 12:22
Your use of continuation-passing style is fascinating, but CPython doesn't actually eliminate tail calls, so that's not the reason why it runs quickly! The main reason it runs quickly appears to be that your `merge` and `fastmerge` functions have incorrect indentation on the last line, so that they do not actually sort the lists! — Gareth Rees, Aug 15 '11 at 12:35
Even when the indentation is corrected, your `fastmerge` is still broken: the `reverse` method returns `None`. And when *that*'s corrected, there's another bug, which I leave to you as an exercise! — Gareth Rees, Aug 15 '11 at 12:39
@Gareth Rees: Yes. \*blush\*. There was the indentation problem, the `reverse()` method returning nothing, and forgetting to call the continuation on the base case. Thanks for the corrections and comments ! I'm in the process of updating everything (gist, answer, TCO considerations, etc) — Francois G, Aug 15 '11 at 13:17
edited to correct the rubbish (thanks again, Gareth !) - will show TCo if I find the time to test it — Francois G, Aug 16 '11 at 18:06
@Gareth : added full tail call elimination examples (using trampolines & stacks). The timings are really not what I expected. Thanks a bunch for your examination of my code ! If you have ideas explaining the results obtained, I'd love to hear about it ! — Francois G, Aug 17 '11 at 20:08
There's still a bug in `fastmerge`! Try `fastmerge([2],[1,3])`, for example. — Gareth Rees, Aug 17 '11 at 23:14
Indeed ! Fixed. thanks again ! Thankfully, my stupidity didn't affect the relative timings significatively this time. — Francois G, Aug 17 '11 at 23:35

score 1 · Answer 3 · answered Aug 15 '11 at 10:10

1

Your merge-sort has a big constant factor, you have to run it on large lists to see the asymptotic complexity benefit.

answered Aug 15 '11 at 10:10

Karoly Horvath

94,607
11
117
176

score 0 · Answer 4 · answered Aug 15 '11 at 10:45

Umm.. 1,000 records?? You are still well within the polynomial cooefficient dominance here.. If I have selection-sort: 15 * n ^ 2 (reads) + 5 * n^2 (swaps) insertion-sort: 5 * n ^2 (reads) + 15 * n^2 (swaps) merge-sort: 200 * n * log(n) (reads) 1000 * n * log(n) (merges)

You're going to be in a close race for a lonng while.. By the way, 2x faster in sorting is NOTHING. Try 100x slower. That's where the real differences are felt. Try "won't finish in my life-time" algorithms (there are known regular expressions that take this long to match simple strings).

So try 1M or 1G records and let us know if you still thing merge-sort isn't doing too well.

That being said..

There are lots of things causing this merge-sort to be expensive. First of all, nobody ever runs quick or merge sort on small scale data-structures.. Where you have if (len <= 1), people generally put: if (len <= 16) : (use inline insertion-sort) else: merge-sort At EACH propagation level.

Since insertion-sort is has smaller coefficent cost at smaller sizes of n. Note that 50% of your work is done in this last mile.

Next, you are needlessly running array1.pop(0) instead of maintaining index-counters. If you're lucky, python is efficiently managing start-of-array offsets, but all else being equal, you're mutating input parameters

Also, you know the size of the target array during merge, why copy-and-double the merged_array repeatedly.. Pre-allocate the size of the target array at the start of the function.. That'll save at least a dozen 'clones' per merge-level.

In general, merge-sort uses 2x the size of RAM.. Your algorithm is probably using 20x because of all the temporary merge buffers (hopefully python can free structures before recursion). It breaks elegance, but generally the best merge-sort algorithms make an immediate allocation of a merge buffer equal to the size of the source array, and you perform complex address arithmetic (or array-index + span-length) to just keep merging data-structures back and forth. It won't be as elegent as a simple recursive problem like this, but it's somewhat close.

In C-sorting, cache-coherence is your biggest enemy. You want hot data-structures so you maximize your cache. By allocating transient temp buffers (even if the memory manager is returning pointers to hot memory) you run the risk of making slow DRAM calls (pre-filling cache-lines for data you're about to over-write). This is one advantage insertion-sort,selection-sort and quick-sort have over merge-sort (when implemented as above)

Speaking of which, something like quick-sort is both naturally-elegant code, naturally efficient-code, and doesn't waste any memory (google it on wikipedia- they have a javascript implementation from which to base your code). Squeezing the last ounce of performance out of quick-sort is hard (especially in scripting languages, which is why they generally just use the C-api to do that part), and you have a worst-case of O(n^2). You can try and be clever by doing a combination bubble-sort/quick-sort to mitigate worst-case.

Happy coding.

Note that those "regular expressions" you're mentioning aren't actual ones by the definition. Only backtracking gets you into that mess.. — Voo, Aug 15 '11 at 12:47

Why is my MergeSort so slow in Python?

Edit:

4 Answers4

Linked