Why should Insertion Sort be used after threshold crossover in Merge Sort

Question

I have read everywhere that for divide and conquer sorting algorithms like Merge-Sort and Quicksort, instead of recursing until only a single element is left, it is better to shift to Insertion-Sort when a certain threshold, say 30 elements, is reached. That is fine, but why only Insertion-Sort? Why not Bubble-Sort or Selection-Sort, both of which has similar O(N^2) performance? Insertion-Sort should come handy only when many elements are pre-sorted (although that advantage should also come with Bubble-Sort), but otherwise, why should it be more efficient than the other two?

And secondly, at this link, in the 2nd answer and its accompanying comments, it says that O(N log N) performs poorly compared to O(N^2) upto a certain N. How come? N^2 should always perform worse than N log N, since N > log N for all N >= 2, right?

Insertion sort is considered fast algorithm for few elements, the reason is cache efficiency if I remember correctly. — amit, Sep 27 '12 at 13:03
There you are! Whenever I ask a question and you give an answer, somehow `cache` always creeps in! :) Yes, I have heard something like that, but nowhere it is explained how. Can you please explain why is it cache-efficient? — SexyBeast, Sep 27 '12 at 13:05
Also, note that Big O notation gives information about the **asymptotic** behaviour of functions. It's not true that an O(n^2) algorithm **always** performs worse than a O(n log n) one. For example if `f(x) = x^2` and `g(x) = 9999999n log n` then for small `n` an algorithm with complexity f(x) will be faster than one with complexity g(x). Asymptotic notation only guarantees that **there exist a number n such that forall m > n we have f(m) > g(m)**. — Haile, Sep 27 '12 at 13:09
Take care of the hidden constants when using big-oh notation, compare f(n)=10^-6.n^2 which is in O(n^2) and g(n)=10^10^10.n.log(n) which is in O(n log n) — Kwariz, Sep 27 '12 at 13:09
I am not sure it is correct, I really do not remember (It is a hunch, and thus a comment, and not a detailed answer). I could be wrong though, I will try to find the time to analyze it with focus on cache later today if no one will do it by then (no promises though) — amit, Sep 27 '12 at 13:10
Ummm, right, so the constants play a big factor. That answers my 2nd question, thanks. But what about the first question? — SexyBeast, Sep 27 '12 at 13:11
@amit, thanks. Promises may fail, but perseverance never fails. Will wait for your attempt. — SexyBeast, Sep 27 '12 at 13:12
P.S. Cache is the immidiate suspect with most small scale problems. A cache miss can take dozens or more times slower then a cache hit, so on small scale problems, even a difference of 2-3 cache misses is significant. — amit, Sep 27 '12 at 13:12
Yeah, have read something about that, locality of reference, etc. Although internet is rife with dozens of such papers analyzing these things, none produce an actual code pointing out which modifications enable the algorithm to ensure cache efficiency and how. — SexyBeast, Sep 27 '12 at 13:14
Sorry for getting a bit off topic, but [this thread](http://stackoverflow.com/questions/9888154/which-of-these-two-for-loops-is-more-efficient-in-terms-of-time-and-cache-perfor) is a classic example of cache failure which can be easily modified. — amit, Sep 27 '12 at 13:17
@Cupidvogel I think that question #1 is linked with question #2. In my copy of *Introduction to algorithms* by Cormen et al. it's written that insertion sort is often used for small n since its hidden constant factors are very small, thus insertion sort is faster than quicksort for small n (because the hidden constants in the O(n log n) time of quicksort are not SO small). — Haile, Sep 27 '12 at 13:18
@Cupidvogel: After carefully thinking : I believe the issue is NOT cache. Modern machines has ~32KB cache, while 30 elements occupy usually much less then it. Thus - for almost any sort algorithm for 30 elements - all of them are expected to be read from memory into cache once and stay there for the entire duration of the sort (no element is expected to be thrown out while sorting these 30 elements). — amit, Sep 27 '12 at 13:50
Good question. I've never seen anyone give a convincing argument for choosing insertion sort as the fallback. A benchmark would be pretty easy to do I think, and would give you your answer (well, it won't answer the why, but it will tell you if the belief is true or not). Please post it if you do run one. — IVlad, Sep 27 '12 at 13:57

Steve Jessop · Answer 1 · 2012-10-19T11:50:30.373

If you bail out of each branch of your divide-and-conquer Quicksort when it hits the threshold, your data looks like this:

[the least 30-ish elements, not in order] [the next 30-ish ] ... [last 30-ish]

Insertion sort has the rather pleasing property that you can call it just once on that whole array, and it performs essentially the same as it does if you call it once for each block of 30. So instead of calling it in your loop, you have the option to call it last. This might not be faster, especially since it pulls the whole data through cache an extra time, but depending how the code is structured it might be convenient.

Neither bubble sort nor selection sort has this property, so I think the answer might quite simply be "convenience". If someone suspects selection sort might be better then the burden of proof lies on them to "prove" that it's faster.

Note that this use of insertion sort also has a drawback -- if you do it this way and there's a bug in your partition code then provided it doesn't lose any elements, just partition them incorrectly, you'll never notice.

Edit: apparently this modification is by Sedgewick, who wrote his PhD on QuickSort in 1975. It was analyzed more recently by Musser (the inventor of Introsort). Reference https://en.wikipedia.org/wiki/Introsort

Musser also considered the effect on caches of Sedgewick's delayed small sorting, where small ranges are sorted at the end in a single pass of insertion sort. He reported that it could double the number of cache misses, but that its performance with double-ended queues was significantly better and should be retained for template libraries, in part because the gain in other cases from doing the sorts immediately was not great.

In any case, I don't think the general advice is "whatever you do, don't use selection sort". The advice is, "insertion sort beats Quicksort for inputs up to a surprisingly non-tiny size", and this is pretty easy to prove to yourself when you're implementing a Quicksort. If you come up with another sort that demonstrably beats insertion sort on the same small arrays, none of those academic sources is telling you not to use it. I suppose the surprise is that the advice is consistently towards insertion sort, rather than each source choosing its own favorite (introductory teachers have a frankly astonishing fondness for bubble sort -- I wouldn't mind if I never hear of it again). Insertion sort is generally thought of as "the right answer" for small data. The issue isn't whether it "should be" fast, it's whether it actually is or not, and I've never particularly noticed any benchmarks dispelling this idea.

One place to look for such data would be in the development and adoption of Timsort. I'm pretty sure Tim Peters chose insertion for a reason: he wasn't offering general advice, he was optimizing a library for real use.

**Instead of calling it in your loop, you have the option to call it last.** I didn't understand that. Both resort to nested loops to do the sorting, no? — SexyBeast, Sep 27 '12 at 14:30
+1, this never occured to me, nor have I ever seen this implemented - but it is obviously true. Do you know of any std::sort implementations that do this? — ltjax, Sep 27 '12 at 14:31
Option 1: `if (size_left_to do < 30) { insertion_sort(data_to_do); continue; }`. Insertion sort is called "in your loop". Option 2: `if (size_left_to_do < 30) continue;`. Insertion sort is not called in the loop, instead you call `insertion_sort(the_original_array)` at the end. — Steve Jessop, Sep 27 '12 at 14:33
Neat property. But for large datasets I suspect it's actually faster to just call insertion sort separately for each block of 30-ish elements, since those elements will already be in cache, which won't be the case if you do one big insertion sort at the end. — j_random_hacker, Sep 27 '12 at 14:33
@j_random_hacker: indeed, I edited to say the same but I guess the messages crossed in the mail. Quicksort possibly pre-dates the routine use of memory caches entirely, and certainly pre-dates the current era where caching is *the* big performance issue (OK, maybe in the current era it's one of *the* two, the other being vectorization, but Quicksort pre-dates the era just before the current era too). The advice is probably of a similar vintage to Quicksort :-) — Steve Jessop, Sep 27 '12 at 14:37

Fred Foo · Accepted Answer · 2012-09-27T14:18:45.443

8

Insertion sort is faster in practice, than bubblesort at least. Their asympotic running time is the same, but insertion sort has better constants (fewer/cheaper operations per iteration). Most notably, it requires only a linear number of swaps of pairs of elements, and in each inner loop it performs comparisons between each of n/2 elements and a "fixed" element that can be stores in a register (while bubble sort has to read values from memory). I.e. insertion sort does less work in its inner loop than bubble sort.
The answer claims that 10000 n lg n > 10 n² for "reasonable" n. This is true up to about 14000 elements.

edited Sep 27 '12 at 14:18

answered Sep 27 '12 at 13:46

Fred Foo

355,277
75
744
836

2

Citation or proof needed for 1. – IVlad Sep 27 '12 at 13:50
@IVlad: e.g., http://www.algorithmist.com/index.php/Insertion_sort -- the number of swaps performed, i.e. the number of reads/writes to memory, is linear in insertion sort but quadratic in bubble sort. Those are likely the most expensive operations in these algorithms as most of the rest can be done in registers. – Fred Foo Sep 27 '12 at 13:58
3

First, swaps and writes are not the same thing: insertion sort has linear swaps, but quadratic writes. Second, selection sort is the one with the least writes actually. So not a good explanation at all. – IVlad Sep 27 '12 at 14:01
@IVlad: hmm, yes, overlooked that. It would seem the comparisons are faster in insertion sort because one of the operands can be cached in a register. – Fred Foo Sep 27 '12 at 14:18

score 5 · Answer 3 · answered Apr 09 '14 at 13:47

5

I am surprised no-one's mentioned the simple fact that insertion sort is simply much faster for "almost" sorted data. That's the reason it's used.

answered Apr 09 '14 at 13:47

oshkosh

551
7
14

amit · Answer 4 · 2012-09-27T15:01:16.997

Here is an empirical proof the insertion sort is faster then bubble sort (for 30 elements, on my machine, the attached implementation, using java...).

I ran the attached code, and found out that the bubble sort ran on average of 6338.515 ns, while insertion took 3601.0

I used wilcoxon signed test to check the probability that this is a mistake and they should actually be the same - but the result is below the range of the numerical error (and effectively P_VALUE ~= 0)

private static void swap(int[] arr, int i, int j) { 
    int temp = arr[i];
    arr[i] = arr[j];
    arr[j] = temp;
}

public static void insertionSort(int[] arr) { 
    for (int i = 1; i < arr.length; i++) {
        int j = i;
        while (j > 0 && arr[j-1] > arr[j]) { 
            swap(arr, j, j-1);
            j--;
        }
    }
}
public static void bubbleSort(int[] arr) { 
    for (int i = 0 ; i < arr.length; i++) { 
        boolean bool = false;
        for (int j = 0; j < arr.length - i ; j++) { 
            if (j + 1 < arr.length && arr[j] > arr[j+1]) {
                bool = true;
                swap(arr,j,j+1);
            }
        }
        if (!bool) break;
    }
}

public static void main(String... args) throws Exception {
    Random r = new Random(1);
    int SIZE = 30;
    int N = 1000;
    int[] arr = new int[SIZE];
    int[] millisBubble = new int[N];
    int[] millisInsertion = new int[N];
    System.out.println("start");
    //warm up:
    for (int t = 0; t < 100; t++) { 
        insertionSort(arr);
    }
    for (int t = 0; t < N; t++) { 
        arr = generateRandom(r, SIZE);
        int[] tempArr = Arrays.copyOf(arr, arr.length);

        long start = System.nanoTime();
        insertionSort(tempArr);
        millisInsertion[t] = (int)(System.nanoTime()-start);

        tempArr = Arrays.copyOf(arr, arr.length);

        start = System.nanoTime();
        bubbleSort(tempArr);
        millisBubble[t] = (int)(System.nanoTime()-start);
    }
    int sum1 = 0;
    for (int x : millisBubble) {
        System.out.println(x);
        sum1 += x;
    }
    System.out.println("end of bubble. AVG = " + ((double)sum1)/millisBubble.length);
    int sum2 = 0;
    for (int x : millisInsertion) {
        System.out.println(x);
        sum2 += x;
    }
    System.out.println("end of insertion. AVG = " + ((double)sum2)/millisInsertion.length);
    System.out.println("bubble took " + ((double)sum1)/millisBubble.length + " while insertion took " + ((double)sum2)/millisBubble.length);
}

private static int[] generateRandom(Random r, int size) {
    int[] arr = new int[size];
    for (int i = 0 ; i < size; i++) 
        arr[i] = r.nextInt(size);
    return arr;
}

EDIT:
(1) optimizing the bubble sort (updated above) reduced the total time taking to bubble sort to: 6043.806 not enough to make a significant change. Wilcoxon test is still conclusive: Insertion sort is faster.

(2) I also added a selection sort test (code attached) and compared it against insertion. The results are: selection took 4748.35 while insertion took 3540.114.
P_VALUE for wilcoxon is still below the range of numerical error (effectively ~=0)

code for selection sort used:

public static void selectionSort(int[] arr) {
    for (int i = 0; i < arr.length ; i++) { 
        int min = arr[i];
        int minElm = i;
        for (int j = i+1; j < arr.length ; j++) { 
            if (arr[j] < min) { 
                min = arr[j];
                minElm = j;
            }
        }
        swap(arr,i,minElm);
    }
}

Oh come on, optimize that bubble sort, even wikipedia has it optimized :). Besides, selection sort would be the interesting one I **think**. — IVlad, Sep 27 '12 at 14:18
And optimize insertion sort too... you don't need to call swap in the inner loop. @Cupidvogel - see its wiki entry. — IVlad, Sep 27 '12 at 14:20
Well, the first pass of the bubble sort puts the highest element in the last position, the 2nd pass places the 2nd-highest element in the 2nd last position. So after the nth pass, when the nth largest element has been placed, naturally the inner loop shouldn't cover 1 to n, but rather `n-i`. This is the standard practice, right? — SexyBeast, Sep 27 '12 at 14:23
@Cupidvogel - you can also end the algorithm early if no swaps were performed in an inner loop iteration. This can help a lot for random test data. — IVlad, Sep 27 '12 at 14:25
@IVlad: I am going to a meeting in a minute, but I promise I'll add selection sort later today. Regarding the optimizations: sorry, haven't programmed any of the naive sorting algorithms in ~4 years. I'll add them as well later on. — amit, Sep 27 '12 at 14:26
@IVlad: I optimized bubble sort a bit, and rechecked. Also tested selection sort. Wilcoxon is conclusive - insertion is the best among these. — amit, Sep 27 '12 at 15:02
Nice, I upvoted shortly after you posted anyway :). Was very cool seeing so many high-rep users here at once. — IVlad, Sep 27 '12 at 15:22

ltjax · Answer 5 · 2012-09-27T14:47:16.377

The easier one first: why insertion sort over selection sort? Because insertion sort is in O(n) for optimal input sequences, i.e. if the sequence is already sorted. Selection sort is always in O(n^2).

Why insertion sort over bubble sort? Both need only a single pass for already sorted input sequences, but insertion sort degrades better. To be more specific, insertion sort usually performs better with a small number of inversion than bubble sort does. Source This can be explained because bubble sort always iterates over N-i elements in pass i while insertion sort works more like "find" and only needs to iterate over (N-i)/2 elements in average (in pass N-i-1) to find the insertion position. So, insertion sort is expected to be about two times faster than insertion sort on average.

+1 for the observation about selection sort always requiring O(n^2) time. But the 2nd para is unsatisfying: the Wikipedia page just repeats what you're saying, without really explaining why. — j_random_hacker, Sep 27 '12 at 14:36

j_random_hacker · Answer 6 · 2012-09-27T14:45:55.713

2

EDIT: As IVlad points out in a comment, selection sort does only n swaps (and therefore only 3n writes) for any dataset, so insertion sort is very unlikely to beat it on account of doing fewer swaps -- but it will likely do substantially fewer comparisons. The reasoning below better fits a comparison with bubble sort, which will do a similar number of comparisons but many more swaps (and thus many more writes) on average.

One reason why insertion sort tends to be faster than the other O(n^2) algorithms like bubble sort and selection sort is because in the latter algorithms, every single data movement requires a swap, which can be up to 3 times as many memory copies as are necessary if the other end of the swap needs to be swapped again later.

With insertion sort OTOH, if the next element to be inserted isn't already the largest element, it can be saved into a temporary location, and all lower elements shunted forward by starting from the right and using single data copies (i.e. without swaps). This opens up a gap to put the original element.

C code for insertion-sorting integers without using swaps:

void insertion_sort(int *v, int n) {
    int i = 1;
    while (i < n) {
        int temp = v[i];         // Save the current element here
        int j = i;

        // Shunt everything forwards
        while (j > 0 && v[j - 1] > temp) {
            v[j] = v[j - 1];     // Look ma, no swaps!  :)
            --j;
        }

        v[j] = temp;
        ++i;
    }
}

edited Sep 27 '12 at 14:45

answered Sep 27 '12 at 14:00

j_random_hacker

50,331
10
105
169

In insertion sort, each element is replaced by the element ahead of it until a value smaller than the value to be positioned is reached. So that does require swaps, right? – SexyBeast Sep 27 '12 at 14:04
@Cupidvogel: That's one way to implement it, but not the fastest way. I'll post a C snippet in a moment. – j_random_hacker Sep 27 '12 at 14:04
1

There is nothing inherently slow about a swap, a swap is just three writes. Insertion sort has more writes on average than selection sort, which has `O(n)` swaps, so I really don't think this is convincing enough. Insertion sort wins in number of reads / comparisons, so a convincing argument would have to be made in that direction I think. – IVlad Sep 27 '12 at 14:06
Ummm, is there more than one variant of Insertion-Sort? – SexyBeast Sep 27 '12 at 14:07
`int temp = v[i]; ... v[j] = temp;` while technically not a swap because you don't exchange the two original values, the complexity is the same. Sorry, but this is a bad argument: you might not technically have any swaps, but you have more writes than selection sort has. The only writes in selection sort come from a single swap in each iteration of the outer loop. You have that + writes in the inner loop. – IVlad Sep 27 '12 at 14:17
@IVlad: Good point about number of reads. But doesn't bubblesort have about the same number of reads as insertion sort? So in the case of the an IS-vs-BS comparison, IS would be faster because it has fewer swaps (and a similar number of reads), whereas in an IS-vs-SS comparison, IS would be faster because it has on average fewer reads/comparisons than SS, and this must win out even though it does on average many more writes. – j_random_hacker Sep 27 '12 at 14:27
Yes, I suspect something like that, maybe with IS being more cache-friendly too for some reason. Plus, on average, the inner loop in IS is not executed all the way through, while selection sort is always exactly `n(n - 1) / 2` iterations. Bubble sort can reduce the iterations in practice, which might make it comparable to IS, although my bet is on IS doing fewer writes and therefore winning. – IVlad Sep 27 '12 at 14:33

Why should Insertion Sort be used after threshold crossover in Merge Sort

6 Answers6

Linked