Why is branch prediction faster than no branch at all?

Question

Inspired by this question: Why is it faster to process a sorted array than an unsorted array?

I wrote my own branch prediction experiment:

public class BranchPrediction {
    public static void main(final String[] args) {
        long start;
        long sum = 0;

        /* No branch */
        start = System.nanoTime();
        sum = 0;
        for (long i = 0; i < 10000000000L; ++i)
            sum += i;
        System.out.println(System.nanoTime() - start);
        System.out.println(sum);

        /* With branch */
        start = System.nanoTime();
        sum = 0;
        for (long i = 0; i < 10000000000L; ++i)
            if (i >= 0)
                sum += i;
        System.out.println(System.nanoTime() - start);
        System.out.println(sum);

        /* No branch (again) */
        start = System.nanoTime();
        sum = 0;
        for (long i = 0; i < 10000000000L; ++i)
            sum += i;
        System.out.println(System.nanoTime() - start);
        System.out.println(sum);

        /* With branch (again) */
        start = System.nanoTime();
        sum = 0;
        for (long i = 0; i < 10000000000L; ++i)
            if (i >= 0)
                sum += i;
        System.out.println(System.nanoTime() - start);
        System.out.println(sum);
    }
}

The result confuses me: according to program output, the loop with a branch is reliably faster than no branch loops.

Example output:

7949691477
-5340232226128654848
6947699555
-5340232226128654848
7920972795
-5340232226128654848
7055459799
-5340232226128654848

Why is it so?

Edit:

Disassembled class shows Java compiler did not optimize (miss) anything (https://gist.github.com/HouzuoGuo/5692424)
The Java benchmark technique used by author of Why is it faster to process a sorted array than an unsorted array? is the same as mine.
The machine is an Intel core i7, running Linux 3.2 64-bit and Oracle JVM 1.7 64-bit
When I supersize the number of loop iterations, with-branch loop runs multi-SECONDS faster than non-branch loop.

Are you sure you aren't seeing the effects of compiler/JIT optimizations? — Oliver Charlesworth, Jun 02 '13 at 01:59
I get very different numbers: 897432717 1236491513 884481456 3572587417 — Ted Hopp, Jun 02 '13 at 02:01
Java 1.7.0_21 64-bit hotspot VM on Linux 3.2 64-bit Intel core i7 — , Jun 02 '13 at 02:02
Obligatory: [How do I write a correct micro-benchmark in Java?](http://stackoverflow.com/questions/504103/how-do-i-write-a-correct-micro-benchmark-in-java) — Mark Peters, Jun 02 '13 at 02:03
Java 1.7.0_11 Java HotSpot(TM) 64-Bit Server VM (build 23.6-b04, mixed mode) — Ted Hopp, Jun 02 '13 at 02:03
So is there a problem in the Java code? Original author of "Why is processing a sorted array faster than an unsorted array?" used the same approach to benchmark his/her program. — , Jun 02 '13 at 02:05
You must understand that Java doesn't work like C++ to begin with. Start reading the link posted by @MarkPeters and then redo your benchmark. — Luiggi Mendoza, Jun 02 '13 at 02:07
I appreciate the suggestion Luiggi, however the author of "Why is processing a sorted array faster than an unsorted array?" wrote the experiment in both C++ and Java. If you see an obvious error in my benchmarking technique, please suggest. — , Jun 02 '13 at 02:08
I supersize the experiment and the result reliably shows the for loop with a branch is multiseconds faster than the loop without branch. — , Jun 02 '13 at 02:12
If you read the answer from your posted link, the author says it has done the benchmark with C++ and Java, but **never says how was done**. — Luiggi Mendoza, Jun 02 '13 at 02:15
Try printing out the `sum` variable between each test. As it is right now, the JIT could legally optimize everything out. — Mysticial, Jun 02 '13 at 02:16
Thanks Mystical. I have modified the program to reset `sum` before each loop and print out the result. — , Jun 02 '13 at 02:21
Perhaps [HotSpot's generated code](http://stackoverflow.com/questions/1551781/how-can-i-see-the-code-that-hotspot-generates-after-optimizing) will yield some insight. — Marcelo Cantos, Jun 02 '13 at 02:23
I think at this point, you'll need to do an assembly dump. Since AFAICT, you're doing the benchmark correctly. — Mysticial, Jun 02 '13 at 02:23
Thank you Luiggi and Mark. However, I don't think JVM is so unpredictable, if multi secconds difference is not enough to prove that one loop is faster than the another, what can? — , Jun 02 '13 at 02:33
You are very likely seeing hotspot's influence on your values. Consider the fact that each iteration of your 'loops' is being completed in under a nanosecond, it's pretty obvious you are seeing the jvm optimize your loops. For example, it would be very trivial for the If statement to be removed since it could easily tell that 'i' will always be greater than or equal to 0. Another optimization, would be to instead of incrementally add each value per loop it might do multiple additions and skip a few iterations as you aren't forcing anything to be done with the intermittent values. — Jyro117, Jun 02 '13 at 03:14
I'd suggest writing these tests into something like JUnit, where your test is not so much effected by `compiler/JIT optimizations` (for better accuracy). — classicjonesynz, Jun 02 '13 at 03:43
Your benchmark methodology is somewhat questionable: Diagnostic output of the HotSpot JIT compiler reveals the method recompiled several times during the timing phase (this is called [deoptimization](http://en.wikipedia.org/wiki/Adaptive_optimization#Deoptimization)). You should not be taking a new code path during measurement, as this will cause you to measure a mixture of unoptimized code, optimization, and optimized code. While this is unlikely to account for a difference of a full second for such simple code, you might keep than in mind for the next benchmark (or use a tested library). — meriton, Jun 02 '13 at 08:28

score 4 · Accepted Answer · 2013-12-03T19:40:20.787

4

After running the same experiment on my other machines (Intel servers and workstations), I may conclude that the phenomenon I experienced is specific to this laptop CPU (Intel i7 Q740M).

==== 6 months later edit ====

Check this out: http://eli.thegreenplace.net/2013/12/03/intel-i7-loop-performance-anomaly/

edited Dec 03 '13 at 19:40

answered Jun 02 '13 at 09:56

1

What did you achieve on other machines? – Dejan Jun 04 '13 at 10:53
My other machines run Intel Xeon and Intel Core 2, both running Linux 64-bit and Oracle JVM. The Xeon CPU shows no real difference between the loops and the Core 2 CPU shows that branched loop runs noticeably slower than no-branch loop. – Jun 04 '13 at 11:10
This i7 CPU must be particularly good at adding a positive number into a negative number. – Jun 04 '13 at 11:13
1

Idk i have Intel i3, I made programs such as yours up there I didn't achieve much with optimizing with if statements and exclusions. – Dejan Jun 04 '13 at 21:00

score 2 · Answer 2 · answered Jun 02 '13 at 10:03

Have in mind that JVM is optimizing execution internally, and there are caches inside your PC that make computing faster. Since you have so powerful processor (many independant cores) it is not strange. Also note that there is code that runs under the Java code which maps to machine code of your PC. Just type code as optimized as you can, let JVM worry about it.

EDIT: Machines and hardware like big load, they operate with more efficiency then. Especially caches.

Why is branch prediction faster than no branch at all?

2 Answers2

Linked