12

I understand what is explained here as well as these would include hints to CPU for static branch prediction.

I was wondering how relevant are these on Intel CPUs now that Intel CPUs have dropped support for static prediction hints as mentioned here. Also if I understand how it works now, the number of branch instructions in the path would be the only thing that the compiler can control and which branch path is predicted, fetched and decoded is decided at runtime.

Given this, are there any scenarios where branch hints in code are still useful for software targeting recent Intel processors, perhaps using conditional return or for avoiding the number of branch instructions in the critical path in case of nested if/else statements?

Also, if these are still relevant, any specifics on gcc and other popular compilers are appreciated.

P.S. I am not for premature optimization or for peppering the code with these macros, but I am interested in the topic as I am working with some time critical code and still like to reduce code clutter where possible.

Thanks

Community
  • 1
  • 1
Aelian
  • 665
  • 1
  • 8
  • 13
  • 2
    Generating code so that the expected path is together in memory still improves code locality, and the compiler can control that. – Jester May 15 '15 at 18:53
  • @Jester Thanks. Agree it could improve the instruction cache performance. Wonder if that is done by gcc now when targeting a specific processor. – Aelian May 15 '15 at 19:44
  • Also, AFAIK, methods don't get split during compilation / linking. So for if/else's in small methods / control blocks the locality improvement may not help much. – Aelian Jun 11 '15 at 13:02
  • 1
    In short, http://blog.man7.org/2012/10/how-much-do-builtinexpect-likely-and.html argues that they make sense if your prediction is right >99.99% (example's for 1 in 10000), of course subject to compiler, cpu, etc. – Sergii Zaskaleta Nov 12 '15 at 10:04
  • After reading more and looking into the disassembly of the code provided in the above blog (on a sandybridge box and compiling with -O3 -march=native) I can say that: 1. There are no special instructions / hints included in machine code. Prediction is by hardware branch predictor. 2. When the predictor does not have any history for the address, the forward jumps are predicted as not taken and backward jumps are predicted as taken. 3. Compiler produced different code for the example in such a way that code would benefit from the behaviour described in '2'. – Aelian Feb 23 '17 at 02:06

1 Answers1

3

As in the comments section for your question you correctly figure out that:

  1. There are no static branch prediction hints in opcode map anymore on Intel x86 CPUs;
  2. Dynamic branch prediction for "cold" conditional jumps tend to predict the fallthrough path;
  3. The compiler can use __builtin_expect to reorder what path of the if-then-else construct will be placed as a fallthrough case in generated assembly.

Now, consider a code base being compiled for multiple target architectures, not just Intel x86. A lot of them do have either static branch hints, dynamic branch predictors of different complexity, or both.

As an example, Intel Itanium architecture does offer an extensive system of prediction hints for all types of instructions: control flow, load/store etc. And Itanium was designed to have code being extensively optimized by a compiler with all these statically assigned instructions slots in a bundle and hints.

Therefore, __builtin_expect is still relevant for (rare) cases when 1) correct branch prediction information was too hard to deduce automatically by a compiler, and 2) the underlying hardware on at least one of target architectures was also known to be unable to reliably predict them dynamically. Given that certain low-power processors include primitive branch predictors that do not track branch history but always choose the fallthrough path, it starts to look beneficial. For modern Intel x86 hardware, not so much.

Grigory Rechistov
  • 2,104
  • 16
  • 25
  • 3
    Not-taken branches are still cheaper than taken, even if both predict correctly: no chance of any front-end bubbles when predicted correctly, and they can run on more execution ports (Intel Haswell). And keeping all the hot code together is better for L1I / uop-cache locality / density. Compilers might also make other decisions based on what's likely or not (e.g. choose not to bloat the code auto-vectorizing a loop that's unlikely to run, maybe). Or maybe choose to use a branch instead of `cmov` if a condition is unlikely. – Peter Cordes Feb 10 '18 at 02:31
  • Correct me if I'm wrong, but isn't the prediction for "fall-through for forward jumps and not fall-through – Yakov Galka Jan 21 '21 at 04:00