19

When I'm writing some tight loop that needs to work fast I am often bothered by thoughts about how the processor branch prediction is going to behave. For instance I try my best to avoid having an if statement in the most inner loop, especially one with a result which is not somewhat uniform (say evaluates to true or false randomly).

I tend to do that because of the somewhat common knowledge that the processor pre-fetches instructions and if it turned out that it mis-predicted a branch then the pre-fetch is useless.

My question is - Is this really an issue with modern processors? How good can branch prediction expected to be?
What coding patterns can be used to make it better?

(For the sake of the discussion, assume that I am beyond the "early-optimization is the root of all evil" phase)

shoosh
  • 76,898
  • 55
  • 205
  • 325
  • I had an interesting surprise when I finally had a chance to test this for myself: http://codereview.stackexchange.com/questions/6502/fastest-way-to-clamp-an-integer-to-the-range-0-255 – Mark Ransom Jan 12 '12 at 16:24
  • See http://stackoverflow.com/questions/11227809/why-is-processing-a-sorted-array-faster-than-an-unsorted-array for a very concrete example – David d C e Freitas Aug 14 '12 at 21:32

7 Answers7

28

Branch prediction is pretty darned good these days. But that doesn't mean the penalty of branches can be eliminated.

In typical code, you probably get well over 99% correct predictions, and yet the performance hit can still be significant. There are several factors at play in this.

One is the simple branch latency. On a common PC CPU, that might be in the order of 12 cycles for a mispredict, or 1 cycle for a correctly predicted branch. For the sake of argument, let's assume that all your branches are correctly predicted, then you're home free, right? Not quite.

The simple existence of a branch inhibits a lot of optimizations. The compiler is unable to reorder code efficiently across branches. Within a basic block (that is, a block of code that is executed sequentially, with no branches, one entry point and one exit), it can reorder instructions as it likes, as long as the meaning of the code is preserved, because they'll all be executed sooner or later. Across branches, it gets trickier. We could move these instructions down to execute after this branch, but then how do we guarantee they get executed? Put them in both branches? That's extra code size, that's messy too, and it doesn't scale if we want to reorder across more than one branch.

Branches can still be expensive, even with the best branch prediction. Not just because of mispredicts, but because instruction scheduling becomes so much harder.

This also implies that rather than the number of branches, the important factor is how much code goes in the block between them. A branch on every other line is bad, but if you can get a dozen lines into a block between branches, it's probably possible to get those instructions scheduled reasonably well, so the branch won't restrict the CPU or compiler too much.

But in typical code, branches are essentially free. In typical code, there aren't that many branches clustered closely together in performance-critical code.

jalf
  • 243,077
  • 51
  • 345
  • 550
  • 2
    The part about CPUs is wrong. The CPU doesn't care if an instruction is after a branch or not. It simply fetches and executes the path that is predicted to be taken, exactly like the "sure" part before the branch. The only difference is that instructions after a predicted branch are marked as speculative (so that the CPU can flush them if the prediction turns out to be wrong), and are held after execution before the retirement stage, until the prediction is confirmed to be correct. So the bottom line is, branches which are correctly predicted are in most cases essentially free. – slacker Apr 20 '10 at 23:26
  • @slacker: true, removed that part – jalf Jul 21 '12 at 09:58
  • 12 vs 1 cycle -> citation needed – Iouri Goussev Nov 15 '12 at 23:27
  • 3
    @IouriGoussev Check [this book](http://www.amazon.com/Computer-Architecture-Quantitative-Approach-Edition/dp/0123704901), or Intels or AMDs optimizations manuals (available for free in pdf form). I'm willing to bet it's mentioned [here](http://www.agner.org/optimize/) as well. And of course, any CS class on computer architecture will mention it. Although, on some architectures, it's more like 25 vs 1. – jalf Nov 16 '12 at 08:12
  • 1
    My [survey paper](https://arxiv.org/pdf/1804.00261.pdf) on branch predictors shows that branch misprediction takes between 14 to 25 cycles. – user984260 Apr 03 '18 at 10:01
  • 2
    @slacker: taken branches are still an obstacle for superscalar CPUs that can fetch/decode/execute multiple instructions per clock. (e.g. up to 4 per clock in Intel Core 2 and later). Instructions in a fetch block after a taken branch are useless. The early parts of the front-end that deal with contiguous blocks of instructions from L1i cache typically lose some throughput on taken branches. Normally these stages have more throughput than the narrowest part of the pipeline (often issue/rename into the OoO backend) so queues between stages can absorb bubbles, though. – Peter Cordes Mar 31 '19 at 23:55
  • 1
    But in general, laying out code so the common fast path through a small function contains no taken branches is a win. (For I-cache footprint as well as fetch/decode. And also, Intel Haswell and later can execute predicted-not-taken branches on either of 2 execution ports, but predicted-taken branches can only execute on port 6. So the correctly-predicted branch throughput is 2 per clock, as long as at least one of them is not taken. uop-cache lines are also ended by unconditionally-taken branches. See Agner Fog's microarch PDF (https://agner.org/optimize/) – Peter Cordes Mar 31 '19 at 23:58
  • 1
    See also [Branch prediction overhead of perfectly predicted branch](//stackoverflow.com/q/49312285) for examples of cases where it is basically free. – Peter Cordes Apr 01 '19 at 00:01
  • 1
    Also, @slacker: out-of-order CPUs treat *all* instructions as speculative until retirement. Being in the shadow of a conditional branch isn't special. Any load or store can fault, and a hardware interrupt can arrive at any time. (To minimize interrupt latency, you typically want to discard in-flight instructions and take the retirement state as the architectural state for handling the interrupt.) Modern CPUs do take special snapshots of the out-of-order state on conditional / indirect branches so they can do fast recovery to the mispredict when it's first discovered, though. – Peter Cordes Apr 01 '19 at 00:07
  • 1
    See also [Characterizing the Branch Misprediction Penalty](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.81.8373&rep=rep1&type=pdf), a nice paper that models the real effect on a superscalar Out-of-Order pipeline, of branch misses, data cache misses, and I-cache misses. And how long it takes the CPU to return to a steady-state of lots of instructions in flight hiding execution latencies. – Peter Cordes Apr 01 '19 at 00:22
  • Further update: density of taken branches can be a problem for branch prediction in modern x86 CPUs (at least Intel), other than a loop-branch for a tiny loop. More than 1 taken branch per 16 bytes of code (IIRC) can result in slowdowns to 1 taken branch per 2 clocks instead of 1/clock. @Noah and I discussed this in a comment thread somewhere, but I can't find it now. – Peter Cordes Dec 11 '21 at 16:46
4

"(For the sake of the discussion, assume that I am beyond the "early-optimization is the root of all evil" phase)"

Excellent. Then you can profile your application's performance, use gcc's tags to make a prediction and profile again, use gcc's tags to make the opposite prediction and profile again.

Now imagine theoretically a CPU that prefetches both branch paths. And for subsequent if statements in both paths, it will prefetch four paths, etc. The CPU doesn't magically grow four times the cache space, so it's going to prefetch a shorter portion of each path than it would do for a single path.

If you find half of your prefetches being wasted, losing say 5% of your CPU time, then you do want to look for a solution that doesn't branch.

Windows programmer
  • 7,871
  • 1
  • 22
  • 23
3

If we're beyond the "early optimization" phase, then surely we're beyond the "I can measure that" phase as well? With the crazy complexities of modern CPU architecture, the only way to know for sure is to try it and measure. Surely there can't be that many circumstances where you will have a choice of two ways to implement something, one of which requires a branch and one which doesn't.

Mark Ransom
  • 299,747
  • 42
  • 398
  • 622
2

Yes, branch prediction really can be a performance issue.

This question (currently the highest-voted question on StackOverflow) gives an example.

Community
  • 1
  • 1
user541686
  • 205,094
  • 128
  • 528
  • 886
2

Not exactly an answer, but you can find here an applet demonstrates the finite state machine often used for table-based branch-prediction in modern microprocessors.

It illustrates the use extra logic to generate a fast (but possibly wrong) estimate for the branch condition and target address.
The processor fetches and executes the predicted instructions at full speed, but needs to revert all intermediate results when the prediction turns out to having been wrong.

VonC
  • 1,262,500
  • 529
  • 4,410
  • 5,250
1

My answer is:

The reason AMD has been as fast or better than Intel at some points is the past is simply that they had better branch prediction.

If your code has no branch prediction, (Meaning it has no branches), then it can be expected to run faster.

So, conclusion: avoid branches if they're not necessary. If they are, try to make it so that one branch is evaluated 95% of the time.

Claudiu
  • 224,032
  • 165
  • 485
  • 680
0

One thing I recently found (on a TI DSP) is that trying to avoid branches can sometimes generate more code than the branch prediction cost.

I had something like the following in a tight loop:

if (var >= limit) { otherVar = 0;}

I wanted to get rid of the potential branch, and tried changing it to:

otherVar *= (var<limit)&1;

But the 'optimization' generated twice as much assembly and was actually slower.

AShelly
  • 34,686
  • 15
  • 91
  • 152