> vs. >= in bubble sort causes significant performance difference

Question

I just stumbled upon something. At first I thought it might be a case of branch misprediction like it is in this case, but I cannot explain why branch misprediction should cause this behaviour.

I implemented two versions of Bubble Sort in Java and did some performance testing:

import java.util.Random;

public class BubbleSortAnnomaly {

    public static void main(String... args) {
        final int ARRAY_SIZE = Integer.parseInt(args[0]);
        final int LIMIT = Integer.parseInt(args[1]);
        final int RUNS = Integer.parseInt(args[2]);

        int[] a = new int[ARRAY_SIZE];
        int[] b = new int[ARRAY_SIZE];
        Random r = new Random();
        for (int run = 0; RUNS > run; ++run) {
            for (int i = 0; i < ARRAY_SIZE; i++) {
                a[i] = r.nextInt(LIMIT);
                b[i] = a[i];
            }

            System.out.print("Sorting with sortA: ");
            long start = System.nanoTime();
            int swaps = bubbleSortA(a);

            System.out.println(  (System.nanoTime() - start) + " ns. "
                               + "It used " + swaps + " swaps.");

            System.out.print("Sorting with sortB: ");
            start = System.nanoTime();
            swaps = bubbleSortB(b);

            System.out.println(  (System.nanoTime() - start) + " ns. "
                               + "It used " + swaps + " swaps.");
        }
    }

    public static int bubbleSortA(int[] a) {
        int counter = 0;
        for (int i = a.length - 1; i >= 0; --i) {
            for (int j = 0; j < i; ++j) {
                if (a[j] > a[j + 1]) {
                    swap(a, j, j + 1);
                    ++counter;
                }
            }
        }
        return (counter);
    }

    public static int bubbleSortB(int[] a) {
        int counter = 0;
        for (int i = a.length - 1; i >= 0; --i) {
            for (int j = 0; j < i; ++j) {
                if (a[j] >= a[j + 1]) {
                    swap(a, j, j + 1);
                    ++counter;
                }
            }
        }
        return (counter);
    }

    private static void swap(int[] a, int j, int i) {
        int h = a[i];
        a[i] = a[j];
        a[j] = h;
    }
}

As we can see, the only difference between those two sorting methods is the > vs. >=. When running the program with java BubbleSortAnnomaly 50000 10 10, one would obviously expect that sortB is slower than sortA because it has to execute more swap(...)s. But I got the following (or similar) output on three different machines:

Sorting with sortA: 4.214 seconds. It used  564960211 swaps.
Sorting with sortB: 2.278 seconds. It used 1249750569 swaps.
Sorting with sortA: 4.199 seconds. It used  563355818 swaps.
Sorting with sortB: 2.254 seconds. It used 1249750348 swaps.
Sorting with sortA: 4.189 seconds. It used  560825110 swaps.
Sorting with sortB: 2.264 seconds. It used 1249749572 swaps.
Sorting with sortA: 4.17  seconds. It used  561924561 swaps.
Sorting with sortB: 2.256 seconds. It used 1249749766 swaps.
Sorting with sortA: 4.198 seconds. It used  562613693 swaps.
Sorting with sortB: 2.266 seconds. It used 1249749880 swaps.
Sorting with sortA: 4.19  seconds. It used  561658723 swaps.
Sorting with sortB: 2.281 seconds. It used 1249751070 swaps.
Sorting with sortA: 4.193 seconds. It used  564986461 swaps.
Sorting with sortB: 2.266 seconds. It used 1249749681 swaps.
Sorting with sortA: 4.203 seconds. It used  562526980 swaps.
Sorting with sortB: 2.27  seconds. It used 1249749609 swaps.
Sorting with sortA: 4.176 seconds. It used  561070571 swaps.
Sorting with sortB: 2.241 seconds. It used 1249749831 swaps.
Sorting with sortA: 4.191 seconds. It used  559883210 swaps.
Sorting with sortB: 2.257 seconds. It used 1249749371 swaps.

When I set the parameter for LIMIT to, e.g., 50000 (java BubbleSortAnnomaly 50000 50000 10), I get the expected results:

Sorting with sortA: 3.983 seconds. It used  625941897 swaps.
Sorting with sortB: 4.658 seconds. It used  789391382 swaps.

I ported the program to C++ to determine whether this problem is Java-specific. Here is the C++ code.

#include <cstdlib>
#include <iostream>

#include <omp.h>

#ifndef ARRAY_SIZE
#define ARRAY_SIZE 50000
#endif

#ifndef LIMIT
#define LIMIT 10
#endif

#ifndef RUNS
#define RUNS 10
#endif

void swap(int * a, int i, int j)
{
    int h = a[i];
    a[i] = a[j];
    a[j] = h;
}

int bubbleSortA(int * a)
{
    const int LAST = ARRAY_SIZE - 1;
    int counter = 0;
    for (int i = LAST; 0 < i; --i)
    {
        for (int j = 0; j < i; ++j)
        {
            int next = j + 1;
            if (a[j] > a[next])
            {
                swap(a, j, next);
                ++counter;
            }
        }
    }
    return (counter);
}

int bubbleSortB(int * a)
{
    const int LAST = ARRAY_SIZE - 1;
    int counter = 0;
    for (int i = LAST; 0 < i; --i)
    {
        for (int j = 0; j < i; ++j)
        {
            int next = j + 1;
            if (a[j] >= a[next])
            {
                swap(a, j, next);
                ++counter;
            }
        }
    }
    return (counter);
}

int main()
{
    int * a = (int *) malloc(ARRAY_SIZE * sizeof(int));
    int * b = (int *) malloc(ARRAY_SIZE * sizeof(int));

    for (int run = 0; RUNS > run; ++run)
    {
        for (int idx = 0; ARRAY_SIZE > idx; ++idx)
        {
            a[idx] = std::rand() % LIMIT;
            b[idx] = a[idx];
        }

        std::cout << "Sorting with sortA: ";
        double start = omp_get_wtime();
        int swaps = bubbleSortA(a);

        std::cout << (omp_get_wtime() - start) << " seconds. It used " << swaps
                  << " swaps." << std::endl;

        std::cout << "Sorting with sortB: ";
        start = omp_get_wtime();
        swaps = bubbleSortB(b);

        std::cout << (omp_get_wtime() - start) << " seconds. It used " << swaps
                  << " swaps." << std::endl;
    }

    free(a);
    free(b);

    return (0);
}

This program shows the same behaviour. Can someone explain what exactly is going on here?

Executing sortB first and then sortA does not change the results.

How did you measure the time? If you measure the time only for one case, the timing will strongly depend on the random sequences and `>` vs `>=` will have only minor impact. To get really meaningful numbers for times you have to measure many different sequences and average — 463035818_is_not_an_ai, May 13 '15 at 19:20
@tobi303 look at the code. You can run it in a loop via the 3rd runtime parameter (Java) or `-DRUNS=XXX` (C++, compiler directive). And the results are reproducible. — Turing85, May 13 '15 at 19:21
would be interesting to count the number of swaps in both cases to see how this relates to the runtime. I mean in case A is slower, this is definitely not due to the number of swaps, so maybe in case A is faster, the reason is also not simply the number of swaps but some more subtle effects — 463035818_is_not_an_ai, May 13 '15 at 19:30
It would also be interesting to see if the results hold when calling `bubbleSortB()` first and then `bubbleSortA()`. With Java I often suspect memory allocation and gc causes unexpected results. Although getting the same results in C++ would suggest something more general is going on here. — Kevin Condon, May 13 '15 at 19:34
@user2357112 yes. As I said: three different machines, same exact result. — Turing85, May 13 '15 at 19:37
You might also want to add a `System.gc()` prior to each bubble sort method call to help equalize the playing field between the tests. Also, allocate plenty of heap (`-xms256m -xmx256m`) and make sure the process isn't being swapped. — Kevin Condon, May 13 '15 at 19:38
@KevinCondon Tested this as well. Changing the order around (first `sortB`, then `sortA`) did not change the results. — Turing85, May 13 '15 at 19:38
@Turing85 One crude way to test if this is related to branch misprediction without getting into a profiler is to just ensure that that all of your elements are unique instead of completely random. For the C++ side, you can use std::sort and std::unique. — , May 13 '15 at 22:52

uesp · Accepted Answer · 2015-05-14T11:44:19.107

45

I think it may indeed be due to branch prediction. If you count the number of swaps compared to the number of inner sort iterations you find:

Limit = 10

A = 560M swaps / 1250M loops
B = 1250M swaps / 1250M loops (0.02% less swaps than loops)

Limit = 50000

A = 627M swaps / 1250M loops
B = 850M swaps / 1250M loops

So in the Limit == 10 case the swap is performed 99.98% of the time in the B sort which is obviously favourable for the branch predictor. In the Limit == 50000 case the swap is only hit randomly 68% so the branch predictor is less beneficial.

edited May 14 '15 at 11:44

answered May 13 '15 at 19:44

uesp

6,194
20
15

2

Your argument seems sensible. Is there any way to test your hypothesis? – Turing85 May 13 '15 at 19:58
1

Quick answer is to control the input arrays to something such that the sorts for A/B do the same swaps in the same order (at least roughly). Exactly how to do this I don't know. You could also look at how random the swap order is "somehow" and see if that correlates to the sort times. – uesp May 13 '15 at 21:09
1

For cases where `LIMIT >= ARRAY_SIZE` you can do a test case where the array is composed of unique numbers. For example, in the case of `a[i] = ARRAY_SIZE - i` you get a swap on every loop and identical times for the A/B sorts. – uesp May 13 '15 at 22:22
@Turing85, note that my answer does in fact explain, _why_ is this difference in number of swaps. – Petr May 14 '15 at 05:23
@Petr why there is a larger number of swaps was obvious for me. I just was not able to correlate this fact with the branch misprediction. And the chosen answer gave (in my sight) the best explanation with the best argumentation. – Turing85 May 14 '15 at 05:50
That the Java and C++ implementation behaves similarly is also a strong indicator for branch prediction being at the root since the chip is the same – Rune FS May 19 '15 at 20:57

score 11 · Answer 2 · answered May 13 '15 at 19:40

I think this can indeed be explained by branch misprediction.

Consider, for example, LIMIT=11, and sortB. On first iteration of the outer loop, it will very quickly stumble upon one of elements equal to 10. So it will have a[j]=10, and therefore definitely a[j] will be >=a[next], as there are no elements that are greater than 10. Therefore, it will perform a swap, then do one step in j only to find that a[j]=10 once again (the same swapped value). So once again it will be a[j]>=a[next], and so one. Every comparison except several at the very beginning will be true. Similarly it will run on the next iterations of the outer loop.

Not the same for sortA. It will start roughly the same way, stumble upon a[j]=10, do some swaps in a similar manner, but only to a point when it finds a[next]=10 too. Then the condition will be false, and no swap will be done. An so on: every time it stumbles on a[next]=10, the condition is false and no swaps are done. Therefore, this condition is true 10 times out of 11 (values of a[next] from 0 to 9), and false in 1 case out of 11. Nothing strange that branch prediction fails.

score 9 · Answer 3 · edited May 14 '15 at 17:08

Using the C++ code provided (time counting removed) with the perf stat command I got results that confirm the brach-miss theory.

With Limit = 10, BubbleSortB highly benefits from branch prediction (0.01% misses) but with Limit = 50000 branch prediction fails even more (with 15.65% misses) than in BubbleSortA (12.69% and 12.76% misses respectively).

BubbleSortA Limit=10:

Performance counter stats for './bubbleA.out':

   46670.947364 task-clock                #    0.998 CPUs utilized          
             73 context-switches          #    0.000 M/sec                  
             28 CPU-migrations            #    0.000 M/sec                  
            379 page-faults               #    0.000 M/sec                  
117,298,787,242 cycles                    #    2.513 GHz                    
117,471,719,598 instructions              #    1.00  insns per cycle        
 25,104,504,912 branches                  #  537.904 M/sec                  
  3,185,376,029 branch-misses             #   12.69% of all branches        

   46.779031563 seconds time elapsed

BubbleSortA Limit=50000:

Performance counter stats for './bubbleA.out':

   46023.785539 task-clock                #    0.998 CPUs utilized          
             59 context-switches          #    0.000 M/sec                  
              8 CPU-migrations            #    0.000 M/sec                  
            379 page-faults               #    0.000 M/sec                  
118,261,821,200 cycles                    #    2.570 GHz                    
119,230,362,230 instructions              #    1.01  insns per cycle        
 25,089,204,844 branches                  #  545.136 M/sec                  
  3,200,514,556 branch-misses             #   12.76% of all branches        

   46.126274884 seconds time elapsed

BubbleSortB Limit=10:

Performance counter stats for './bubbleB.out':

   26091.323705 task-clock                #    0.998 CPUs utilized          
             28 context-switches          #    0.000 M/sec                  
              2 CPU-migrations            #    0.000 M/sec                  
            379 page-faults               #    0.000 M/sec                  
 64,822,368,062 cycles                    #    2.484 GHz                    
137,780,774,165 instructions              #    2.13  insns per cycle        
 25,052,329,633 branches                  #  960.179 M/sec                  
      3,019,138 branch-misses             #    0.01% of all branches        

   26.149447493 seconds time elapsed

BubbleSortB Limit=50000:

Performance counter stats for './bubbleB.out':

   51644.210268 task-clock                #    0.983 CPUs utilized          
          2,138 context-switches          #    0.000 M/sec                  
             69 CPU-migrations            #    0.000 M/sec                  
            378 page-faults               #    0.000 M/sec                  
144,600,738,759 cycles                    #    2.800 GHz                    
124,273,104,207 instructions              #    0.86  insns per cycle        
 25,104,320,436 branches                  #  486.101 M/sec                  
  3,929,572,460 branch-misses             #   15.65% of all branches        

   52.511233236 seconds time elapsed

Captain Man · Answer 4 · 2015-05-14T14:18:17.773

Edit 2: This answer is probably wrong in most cases, lower when I say everything above is correct is still true, but the lower portion is not true for most processor architectures, see the comments. However, I will say that it's still theoretically possible there is some JVM on some OS/Architecture that does this, but that JVM is probably poorly implemented or it's a weird architecture. Also, this is theoretically possible in the sense that most conceivable things are theoretically possible, so I'd take the last portion with a grain of salt.

First, I am not sure about the C++, but I can talk some about the Java.

Here is some code,

public class Example {

    public static boolean less(final int a, final int b) {
        return a < b;
    }

    public static boolean lessOrEqual(final int a, final int b) {
        return a <= b;
    }
}

Running javap -c on it I get the bytecode

public class Example {
  public Example();
    Code:
       0: aload_0
       1: invokespecial #8                  // Method java/lang/Object."<init>":()V
       4: return

  public static boolean less(int, int);
    Code:
       0: iload_0
       1: iload_1
       2: if_icmpge     7
       5: iconst_1
       6: ireturn
       7: iconst_0
       8: ireturn

  public static boolean lessOrEqual(int, int);
    Code:
       0: iload_0
       1: iload_1
       2: if_icmpgt     7
       5: iconst_1
       6: ireturn
       7: iconst_0
       8: ireturn
}

You'll notice the only difference is if_icmpge (if compare greater/equal) versus if_icmpgt (if compare greater than).

Everything above is fact, the rest is my best guess as to how if_icmpge and if_icmpgt are handled based on a college course I took on assembly language. To get a better answer you should look up how your JVM handles these. My guess is that C++ also compiles down to a similar operation.

Edit: Documentation on if_i<cond> is here

The way computers compare numbers is subtracting one from the other and checking if that number is 0 or not, so when doing a < b if subtracts b from a and sees if the result is less than 0 by checking the sign of the value (b - a < 0). To do a <= b though it has to do an additional step and subtract 1 (b - a - 1 < 0).

Normally this is a very miniscule difference, but this isn't any code, this is freaking bubble sort! O(n^2) is the average number of times we are doing this particular comparison because it's in the inner most loop.

Yes, it may have something to do with branch prediction I am not sure, I am not an expert on that, but I think this may also play a non-insignificant role.

I don't think you are correct about `<` being faster than `<=`. Processor instructions are discretized; each instruction must take an integer number of clock cycles-- there's no "time saved" unless you can squeeze a whole clock out of it. See http://stackoverflow.com/a/12135533 — kevinsa5, May 13 '15 at 22:24
Note that I'm talking about native code only. I suppose it's possible that a JVM implementation could perform that "optimization", but I'd guess that it would just use the native instructions rather than cook up its own solution. But that's just a guess. — kevinsa5, May 13 '15 at 22:26
On what do you base your assertion that <= uses an extra step to subtract an extra 1? At the x86 level, for example, a `cmp` followed by a `jl` will take exactly the same amount of time, successful branch prediction permitting, as a `cmp` followed by a `jle`. http://stackoverflow.com/questions/12135518/is-faster-than has more details. — ClickRick, May 14 '15 at 00:46
@ClickRick The assembly I learned was for SPARC which used a reduced instruction set. Maybe it didn't have `jle`? Or maybe I heard this false assumption somewhere too. Not 100% sure where I got this now that I really consider it. I suppose theoretically though that how any particular OS's/Architecture's JVM interpreted it could make some difference, but I would now assume they all do this in a single cycle. — Captain Man, May 14 '15 at 03:53
@CaptainMan According to http://www.cs.northwestern.edu/~agupta/_projects/sparc_simulator/SPARC%20INSTRUCTION%20SET-1.htm the SPARC supports both `bl` and `ble` instructions, which is entirely unsurprising to me. — ClickRick, May 14 '15 at 08:24
@Click I am utterly at a loss how I came to this then. Maybe I heard something like `for (int i = 0; i <= thing.size() - 1; i++)` had to do an extra operation when I was learning and over the years associated it with `<=` (the `- 1` doesn't help haha). I'll add an edit that my answer is wrong unless the JVM for some particular OS/Architecture is implemented oddly. — Captain Man, May 14 '15 at 14:11

> vs. >= in bubble sort causes significant performance difference

4 Answers4