How much does a mispredicted conditional branch cost?

Question

On x86-64 whatever micro architecture and ARM64 devices, how many clock cycles does a mispredicted conditional branch cost? And I suppose I should also ask what the figure is for a successfully predicted branch taken/not taken ? I can try and find this in Agner Fog’s tables but I’m interested in ARM equally.

Is there a reasonably easy way of getting this data out of the processor itself?

score 3 · Accepted Answer · answered Aug 04 '23 at 22:08

Mispredicted branches just stall the front-end, not the entire pipeline. So the cost in terms of overall performance impact depends on the code. If it was bottlenecked purely on the front-end, losing 15 to 19 cycles of front-end throughput costs that many cycles of total time, but many other programs can somewhat hide the bubble since they have other work in flight to still be working on.

See

What exactly happens when a skylake CPU mispredicts a branch?
Avoid stalling pipeline by calculating conditional early
and in general What considerations go into predicting latency for operations on modern superscalar processors and how can I calculate them by hand? - costs aren't one-dimensional in general, e.g. you can't add up the "cycles" cost of different instructions to get a total cost, because that's not how out-of-order exec works with different execution units for different instructions.

It's something you can microbenchmark, but it's somewhat tricky to construct such a benchmark. https://www.7-cpu.com/ has numbers for many CPUs, e.g.

Cortex A76 is reported as a 14-cycle penalty,
Skylake 16.5 cycles average (if mOp cache hit) or 19-20 cycles (if mOp cache miss). The uop-cache effectively shortens the pipeline, fewer stages between re-steer and having uops ready to issue from the front-end into the back-end.
Cortex A53: 7 cycles. Much shorter recovery time, as expected for a simpler in-order pipeline.

I suspect those numbers are from vendor manuals, unless 7-cpu has a standard benchmark they use.

Also yes, Agner Fog attempted to microbenchmark this for many x86 CPUs, but hard numbers are hard to measure; he reports that measurements were pretty noisy on some CPUs. e.g. for Haswell/Broadwell he writes in his microarch PDF

There may be a difference in branch misprediction penalty between the three sources of µops, but I have not been able to verify such a difference because the variance in the measurements is high. The measured misprediction penalty varies between 16 and 20 clock cycles in all three cases.

Many thanks to Peter for his generous and exceptionally helpful reply once again. — Cecil Ward, Aug 04 '23 at 22:28

How much does a mispredicted conditional branch cost?

1 Answers1