Branch-aware programming

Question

I'm reading around that branch misprediction can be a hot bottleneck for the performance of an application. As I can see, people often show assembly code that unveil the problem and state that programmers usually can predict where a branch could go the most of the times and avoid branch mispredictons.

My questions are:

Is it possible to avoid branch mispredictions using some high level programming technique (i.e. no assembly)?
What should I keep in mind to produce branch-friendly code in a high level programming language (I'm mostly interested in C and C++)?

Code examples and benchmarks are welcome.

Related: [Why is processing a sorted array faster than an unsorted array?](http://stackoverflow.com/questions/11227809/why-is-processing-a-sorted-array-faster-than-an-unsorted-array). Look at its currently [highest voted answer](http://stackoverflow.com/a/11227902/514235). — iammilind, Sep 15 '15 at 08:52
Since branch prediction only happens at the machine level, it doesn't really make sense to ask for it at a high-level programming language level. Compilers usually contain vendor-specific mechanisms to annotate a conditional with an expected result, but it's still up to the compiler to generate what it thinks is the best machine code (and this may be modified e.g. by profile-guided optimizations or space constraints). Ultimately, you need to know the machine if you care about the details of the machine, and you need to understand your profiling tools. — Kerrek SB, Sep 15 '15 at 08:53
You should trust your *optimizing* compiler on that. [GCC](http://gcc.gnu.org/) gives you [`__builtin_expect`](https://gcc.gnu.org/onlinedocs/gcc/Other-Builtins.html) — Basile Starynkevitch, Sep 15 '15 at 08:53
Keep lists sorted can help as this will allow code such as 'if (x < 10)` to stick to one path longer — rhughes, Sep 15 '15 at 08:53
It's compiler-specific. GCC, for example, has [_builtin_expect](https://gcc.gnu.org/onlinedocs/gcc/Other-Builtins.html#index-g_t_005f_005fbuiltin_005fexpect-4089). — David Schwartz, Sep 15 '15 at 08:53
http://stackoverflow.com/questions/3702903/portable-branch-prediction-hints — Tom, Sep 15 '15 at 08:53
@iammilind Yes, I'd read that thread. It's very interesting and it gives some hints of what I'm looking for, like *loop-interchange*, *data sorting*, ... — Paolo M, Sep 15 '15 at 08:57
There're things like "noreturn" function attributes to hint that the code path is unlikely and the compiler should optimize the code to follow the alternative path. — sharptooth, Sep 15 '15 at 09:00
The standard way AFAIK is to do profile guided optimization ( https://en.wikipedia.org/wiki/Profile-guided_optimization ). For example, GCC/ICC support it and I expect all other major compilers to do so. The way it works is simple and elegant: (a) compile program using annotations to keep track of the commonly executed code portions; (b) run sample workload; (c) re-compile the program using the statistics gathered to minimize branch mispredictions. — Radu, Sep 15 '15 at 09:01
@KerrekSB You have been very clear, thanks. You would say that answer to question n.1 is **NO**, right? But what about question n.2? I mean, what can I do as a programmer when designing and implementing an application to write *branch-friendly* code? In my previous comment I explain what I'm looking for. — Paolo M, Sep 15 '15 at 09:03
@rhughes Thanks, I'm looking for *guidelines* like this one. — Paolo M, Sep 15 '15 at 09:07
@PaoloM No problem. Just remember that the time it takes to actually sort the list may outweigh the benefits of you are looking for. — rhughes, Sep 15 '15 at 09:08
@PaoloM: To write branch-friendly code, you need to learn how to use profiling tools, and to do that you need to understand their output, and I expect that you'll need to be able to read machine code at some point. — Kerrek SB, Sep 15 '15 at 09:21
You should also read [Is there a compiler hint for GCC to force branch prediction to always go a certain way?](http://stackoverflow.com/q/30130930/1708801) and all the caveats that are laid out in the answers there. — Shafik Yaghmour, Sep 15 '15 at 09:25
It's very important to keep the "big picture" in view. First, *profile the code and find out which parts are worth optimising*. The most extreme real-world example I have worked on was a 250,000-line program where more than 90% of the computation was done in one loop that was just *3 lines of code*. There was no way to eliminate the work done in that loop. Optimizing *anything* in the rest of the program would have been a total waste of effort. — alephzero, Sep 15 '15 at 11:21
As for Q2 ("_what should I keep in mind_...") almost nothing _until_ (following @alephzero's comment and Knuth's "_beware premature optimisation_") you know for certain that a piece of code is causing (disproportionate) problems. — TripeHound, Sep 15 '15 at 11:34
In some situations you can rewrite code with branches to be branch-free which can be a win, depending on your situation. There's some good articles about this at [Cell Performance](http://cellperformance.beyond3d.com/articles/2006/04/benefits-to-branch-elimination.html) — mattnewport, Sep 15 '15 at 21:17

score 30 · Accepted Answer · edited May 23 '17 at 11:54

people often ... and state that programmers usually can predict where a branch could go

(*) Experienced programmers often remind that human programmers are very bad at predicting that.

1- Is it possible to avoid branch mispredictions using some high level programming technique (i.e. no assembly)?

Not in standard c++ or c. At least not for a single branch. What you can do is minimize the depth of your dependency chains so that branch mis-prediction would not have any effect. Modern cpus will execute both code paths of a branch and drop the one that wasn't chosen. There's a limit to this however, which is why branch prediction only matters in deep dependency chains.

Some compilers provide extension for suggesting the prediction manually such as __builtin_expect in gcc. Here is a stackoverflow question about it. Even better, some compilers (such as gcc) support profiling the code and automatically detect the optimal predictions. It's smart to use profiling rather than manual work because of (*).

2- What should I keep in mind to produce branch-friendly code in a high level programming language (I'm mostly interested in C and C++)?

Primarily, you should keep in mind that branch mis-prediction is only going to affect you in the most performance critical part of your program and not to worry about it until you've measured and found a problem.

But what can I do when some profiler (valgrind, VTune, ...) tells that on line n of foo.cpp I got a branch prediction penalty?

Lundin gave very sensible advice

Measure fo find out whether it matters.
If it matters, then
- Minimize the depth of dependency chains of your calculations. How to do that can be quite complicated and beyond my expertise and there's not much you can do without diving into assembly. What you can do in a high level language is to minimize the number of conditional checks (**). Otherwise you're at the mercy of compiler optimization. Avoiding deep dependency chains also allows more efficient use of out-of-order superscalar processors.
- Make your branches consistently predictable. The effect of that can be seen in this stackoverflow question. In the question, there is a loop over an array. The loop contains a branch. The branch depends on size of the current element. When the data was sorted, the loop could be demonstrated to be much faster when compiled with a particular compiler and run on a particular cpu. Of course, keeping all your data sorted will also cost cpu time, possibly more than the branch mis-predictions do, so, measure.
If it's still a problem, use profile guided optimization (if available).

Order of 2. and 3. may be switched. Optimizing your code by hand is a lot of work. On the other hand, gathering the profiling data can be difficult for some programs as well.

(**) One way to do that is transform your loops by for example unrolling them. You can also let the optimizer do it automatically. You must measure though, because unrolling will affect the way you interact with the cache and may well end up being a pessimization.

I consider question 1 as answered, thanks. But what can I do when some profiler (*valgrind*, *VTune*, ...) tells that on line n of foo.cpp I got a branch prediction penalty? — Paolo M, Sep 15 '15 at 09:05
@PaoloM You should look at that code and see if that penalty at all matters to program performance. Most likely it does not. In the rare case where it does, you would simply try to rewrite the code so that it contains as few conditional checks as possible. — Lundin, Sep 15 '15 at 09:13
Even gcc notes on `__builtin_expect` which I [quote here](http://stackoverflow.com/q/30130930/1708801) say *you should prefer to use actual profile feedback for this (-fprofile-arcs), as programmers are notoriously bad at predicting how their programs actually perform* — Shafik Yaghmour, Sep 15 '15 at 09:24
"transform your loops by for example unrolling them" -- I'm pretty sure the compiler will do that for you... — John Dvorak, Sep 15 '15 at 10:36
@JanDvorak Yes, if you ask it to do that with appropriate optimization flags. There are cases however, where letting compiler unroll all your loops (at the optimizer's discretion) is undesirable in which case you'll have to unroll manually the loops for which it *is* desirable. — eerorika, Sep 15 '15 at 10:47
Another interesting rule of thumb to prevent branches can be found on the [cachegrind user manual](http://valgrind.org/docs/manual/cg-manual.html#cg-manual.acting-on): *Looking at the Bcm and Bim misses can also be helpful. In particular, Bim misses are often caused by switch statements, and in some cases these switch statements can be replaced with table-driven code.* — Paolo M, Sep 16 '15 at 07:04
@PaoloM indeed, getting rid of your conditionals (the `switch`) will get you rid of the branches - and therefore branch prediction misses - altogether. — eerorika, Sep 16 '15 at 07:13
You have (*) in the answer, but I can't find the relation. And I would like to upvote, buts this keeps me from doing so. — gsamaras, Oct 14 '15 at 12:27
@gsamaras The first (*) is the referenced paragraph and it's later referred to. I had confusingly put the fist one at the end of the paragraph and moved it now to the front. I hope it clears that up. — eerorika, Oct 14 '15 at 12:46
Most of this is wrong. CPUs don't speculate both paths. Dependency chains are almost irrelevant to misprediction cost. Speculation is like ~200 instructions ahead nowadays. Branch prediction is one of the major players behind tractable IPC (memory latency being by far the leading one). Plus the whole "only experts can handle optimizing things" schtick continues to irk me. — Veedrac, Aug 08 '18 at 10:44

score 22 · Answer 2 · 2015-12-14T15:35:40.863

As a caveat, I'm not a micro-optimization wizard. I don't know exactly how the hardware branch predictor works. To me it's a magical beast against which I play scissors-paper-stone and it seems to be able to read my mind and beat me all the time. I'm a design & architecture type.

Nevertheless, since this question was about a high-level mindset, I might be able to contribute some tips.

Profiling

As said, I'm not a computer architecture wizard, but I do know how to profile code with VTune and measure things like branch mispredictions and cache misses and do it all the time being in a performance-critical field. That's the very first thing you should be looking into if you don't know how to do this (profiling). Most of these micro-level hotspots are best discovered in hindsight with a profiler in hand.

Branch Elimination

A lot of people are giving some excellent low-level advice on how to improve the predictability of your branches. You can even manually try to aid the branch predictor in some cases and also optimize for static branch prediction (writing if statements to check for the common cases first, e.g.). There's a comprehensive article on the nitty-gritty details here from Intel: https://software.intel.com/en-us/articles/branch-and-loop-reorganization-to-prevent-mispredicts.

However, doing this beyond a basic common case/rare case anticipation is very hard to do and it is almost always best saved for later after you measure. It's just too difficult for humans to be able to accurately predict the nature of the branch predictor. It's far more difficult to predict than things like page faults and cache misses, and even those are almost impossible to perfectly humanly-predict in a complex codebase.

However, there is an easier, high-level way to mitigate branch misprediction, and that's to avoid branching completely.

Skipping Small/Rare Work

One of the mistakes I commonly made earlier in my career and see a lot of peers trying to do when they're starting out, before they've learned to profile and are still going by hunches, is to try to skip small or rare work.

An example of this is memoizing to a large look-up table to avoid repeatedly doing some relatively-cheap computations, like using a look-up table that spans megabytes to avoid repeatedly calling cos and sin. To a human brain, this seems like it's saving work to compute it once and store it, except often loading the memory from this giant LUT down through the memory hierarchy and into a register often ends up being even more expensive than the computations they were intended to save.

Another case is adding a bunch of little branches to avoid small computations which are harmless to do unnecessarily (won't impact correctness) throughout the code as a naive attempt at optimization, only to find the branching costs more than just doing unnecessary computations.

This naive attempt at branching as an optimization can also apply even for slightly-expensive but rare work. Take this C++ example:

struct Foo
{
    ...
    Foo& operator=(const Foo& other)
    {
        // Avoid unnecessary self-assignment.
        if (this != &other)
        {
            ...
        }
        return *this;
    }
    ...
};

_{Note that this is somewhat of a simplistic/illustrative example as most people implement copy assignment using copy-and-swap against a parameter passed by value and avoid branching anyway no matter what.}

In this case, we're branching to avoid self-assignment. Yet if self-assignment is only doing redundant work and doesn't hinder the correctness of the result, it can often give you a boost in real-world performance to simply allow the self-copying:

struct Foo
{
    ...
    Foo& operator=(const Foo& other)
    {
        // Don't check for self-assignment.
        ...
        return *this;
    }
    ...
};

... this can help because self-assignment tends to be quite rare. We're slowing down the rare case by redundantly self-assigning, but we're speeding up the common case by avoiding the need to check in all other cases. Of course that's unlikely to reduce branch mispredictions significantly since there is a common/rare case skew in terms of the branching, but hey, a branch that doesn't exist can't be mispredicted.

A Naive Attempt at a Small Vector

As a personal story, I formerly worked in a large-scale C codebase which often had a lot of code like this:

char str[256];
// do stuff with 'str'

... and naturally since we had a pretty extensive user base, some rare user out there would eventually type in a name for a material in our software that was over 255 characters in length and overflow the buffer, leading to segfaults. Our team was getting into C++ and started porting a lot of these source files to C++ and replacing such code with this:

std::string str = ...;
// do stuff with 'str'

... which eliminated those buffer overruns without much effort. However, at least back then, containers like std::string and std::vector were heap(free store)-allocated structures, and we found ourselves trading correctness/safety for efficiency. Some of these replaced areas were performance-critical (called in tight loops), and while we eliminated a lot of bug reports with these mass replacements, the users started noticing the slowdowns.

So then we wanted something which was like a hybrid between these two techniques. We wanted to be able to slap something in there to achieve safety over the C-style fixed-buffer variants (which were perfectly fine and very efficient for common-case scenarios), but still work for the rare-case scenarios where the buffer wasn't big enough for user inputs. I was one of the performance geeks on the team and one of the few using a profiler (I unfortunately worked with a lot of people who thought they were too smart to use one), so I got called into the task.

My first naive attempt was something like this (vastly simplified: the actual one used placement new and so forth and was a fully standard-compliant sequence). It involves using a fixed-size buffer (size specified at compile-time) for the common case and a dynamically-allocated one if the size exceeded that capacity.

template <class T, int N>
class SmallVector
{
public:
    ...
    T& operator[](int n)
    {
        return num < N ? buf[n]: ptr[n];
    }
    ...
private:
    T buf[N];
    T* ptr;
};

This attempt was an utter fail. While it didn't pay the price of the heap/free store to construct, the branching in operator[] made it even worse than std::string and std::vector<char> and was showing up as a profiling hotspot instead of malloc (our vendor implementation of std::allocator and operator new used malloc under the hood). So then I quickly got the idea to simply assign ptr to buf in the constructor. Now ptr points to buf even in the common case scenario, and now operator[] can be implemented like this:

T& operator[](int n)
{
    return ptr[n];
}

... and with that simple branch elimination, our hotspots went away. We now had a general-purpose, standard-compliant container we could use that was just about as fast as the former C-style, fixed-buffer solution (only difference being one additional pointer and a few more instructions in the constructor), but could handle those rare-case scenarios where the size needed to be larger than N. Now we use this even more than std::vector (but only because our use cases favor a bunch of teeny, temporary, contiguous, random-access containers). And making it fast came down to just eliminating a branch in operator[].

Common Case/Rare Case Skewing

One of the things learned after profiling and optimizing for years is that there's no such thing as "absolutely-fast-everywhere" code. A lot of the act of optimization is trading an inefficiency there for greater efficiency here. Users might perceive your code as absolutely-fast-everywhere, but that comes from smart tradeoffs where the optimizations are aligning with the common case (common case being both aligned with realistic user-end scenarios and coming from hotspots pointed out from a profiler measuring those common scenarios).

Good things tend to happen when you skew the performance towards the common case and away from the rare case. For the common case to get faster, often the rare case must get slower, yet that's a good thing.

Zero-Cost Exception-Handling

An example of common case/rare case skewing is the exception-handling technique used in a lot of modern compilers. They apply zero-cost EH, which isn't really "zero-cost" all across the board. In the case that an exception is thrown, they're now slower than ever before. Yet in the case where an exception isn't thrown, they're now faster than ever before and often faster in successful scenarios than code like this:

if (!try_something())
    return error;
if (!try_something_else())
    return error;
...

When we use zero-cost EH here instead and avoid checking for and propagating errors manually, things tend to go even faster in the non-exceptional cases than this style of code above. Crudely speaking, it's due to the reduced branching. Yet in exchange, something far more expensive has to happen when an exception is thrown. Nevertheless, that skew between common case and rare case tends to aid real-world scenarios. We don't care quite as much about the speed of failing to load a file (rare case) as loading it successfully (common case), and that's why a lot of modern C++ compilers implement "zero-cost" EH. It is again in the interest of skewing the common case and rare case, pushing them further away from each in terms of performance.

Virtual Dispatch and Homogeneity

A lot of branching in object-oriented code where the dependencies flow towards abstractions (stable abstractions principle, e.g.), can have a large bulk of its branching (besides loops of course, which play well to the branch predictor) in the form of dynamic dispatch (virtual function calls or function pointer calls).

In these cases, a common temptation is to aggregate all kinds of sub-types into a polymorphic container storing a base pointer, looping through it and calling virtual methods on each element in that container. This can lead to a lot of branch mispredictions, especially if this container is being updated all the time. The pseudocode might look like this:

for each entity in world:
    entity.do_something() // virtual call

A strategy to avoid this scenario is to start sorting this polymorphic container based on its sub-types. This is a fairly old-style optimization popular in the gaming industry. I don't know how helpful it is today, but it is a high-level kind of optimization.

Another way I've found to be definitely still be useful even in recent cases which achieves a similar effect is to break the polymorphic container apart into multiple containers for each sub-type, leading to code like this:

for each human in world.humans():
    human.do_something()
for each orc in world.orcs():
    orc.do_something()
for each creature in world.creatures():
    creature.do_something()

... naturally this hinders the maintainability of the code and reduces the extensibility. However, you don't have to do this for every single sub-type in this world. We only need to do it for the most common. For example, this imaginary video game might consist, by far, of humans and orcs. It might also have fairies, goblins, trolls, elves, gnomes, etc., but they might not be nearly as common as humans and orcs. So we only need to split the humans and orcs away from the rest. If you can afford it, you can also still have a polymorphic container that stores all of these subtypes which we can use for less performance-critical loops. This is somewhat akin to hot/cold splitting for optimizing locality of reference.

Data-Oriented Optimization

Optimizing for branch prediction and optimizing memory layouts tends to kind of blur together. I've only rarely attempted optimizations specifically for the branch predictor, and that was only after I exhausted everything else. Yet I've found that focusing a lot on memory and locality of reference did make my measurements result in fewer branch mispredictions (often without knowing exactly why).

Here it can help to study data-oriented design. I've found some of the most useful knowledge relating to optimization comes from studying memory optimization in the context of data-oriented design. Data-oriented design tends to emphasize fewer abstractions (if any), and bulkier, high-level interfaces that process big chunks of data. By nature such designs tend to reduce the amount of disparate branching and jumping around in code with more loopy code processing big chunks of homogeneous data.

It often helps, even if your goal is to reduce branch misprediction, to focus more on consuming data more quickly. I've found some great gains before from branchless SIMD, for example, but the mindset was still in the vein of consuming data more quickly (which it did, and thanks to some help from here on SO like Harold).

TL;DR

So anyway, these are some strategies to potentially reduce branch mispredictions throughout your code from a high-level standpoint. They're devoid of the highest level of expertise in computer architecture, but I'm hoping this is an appropriate kind of helpful response given the level of the question being asked. A lot of this advice is kind of blurred with optimization in general, but I've found that optimizing for branch prediction often needs to be blurred with optimizing beyond it (memory, parallelization, vectorization, algorithmic). In any case, the safest bet is to make sure you have a profiler in your hand before you venture deep.

It is frustrating to hear that people still implement their own small vectors instead of using the one from Boost. But perhaps this was a while ago before it was implemented in Boost. — olq_plo, Jan 28 '22 at 19:36

ouah · Answer 3 · 2015-09-15T09:13:08.130

18

Linux kernel defines likely and unlikely macros based on __builtin_expect gcc builtins:

    #define likely(x)   __builtin_expect(!!(x), 1)
    #define unlikely(x) __builtin_expect(!!(x), 0)

(See here for the macros definitions in include/linux/compiler.h)

You can use them like:

if (likely(a > 42)) {
    /* ... */
}

or

if (unlikely(ret_value < 0)) {
    /* ... */
}

edited Sep 15 '15 at 09:13

answered Sep 15 '15 at 09:07

ouah

142,963
15
272
331

Did not know a kernel defines macros :) – BitTickler Sep 20 '20 at 21:58

score 7 · Answer 4 · answered Sep 15 '15 at 09:36

In general it's a good idea to keep hot inner loops well proportioned to the cache sizes most commonly encountered. That is, if your program handles data in lumps of, say, less than 32kbytes at a time and does a decent amount of work on it then you're making good use of the L1 cache.

In contrast if your hot inner loop chews through 100MByte of data and performs only one operation on each data item, then the CPU will spend most of the time fetching data from DRAM.

This is important because part of the reason CPUs have branch prediction in the first place is to be able to pre-fetch operands for the next instruction. The performance consequences of a branch mis-prediction can be reduced by arranging your code so that there's a good chance that the next data comes from L1 cache no matter what branch is taken. Whilst not a perfect strategy, L1 cache sizes seem to be universally stuck on 32 or 64K; it's almost a constant thing across the industry. Admittedly coding in this way is not often straightforward, and relying on profile driven optimisation, etc. as recommended by others is probably the most straightforward way ahead.

Regardless of anything else, whether or not a problem with branch mis-prediction will occur varies according to the CPU's cache sizes, what else is running on the machine, what the main memory bandwidth / latency is, etc.

score 7 · Answer 5 · answered Sep 15 '15 at 10:42

7

Perhaps the most common techniques is to use separate methods for normal and error returns. C has no choice, but C++ has exceptions. Compilers are aware that the exception branches are exceptional and therefore unexpected.

This means that exception branches are indeed slow, as they're unpredicted, but the non-error branch is made faster. On average, this is a net win.

answered Sep 15 '15 at 10:42

MSalters

173,980
10
155
350

If the error has any non-neglegible chance of occurring, this advise is dead wrong: The performance cost of an occurring exception is huge. Never introduce exceptions into your program flow if you care about performance. – cmaster - reinstate monica Nov 06 '17 at 09:16
@cmaster: Even if the chance of an exception is non-negligible and you care about the performance _in the non-exceptional case_, you often don't care about the performance in the exceptional case. Example: compiling code. Compile errors can certainly happen, and build times for large projects certainly are a concern. But the overhead of an exception is utterly dwarfed by the time spent by the human looking at the error. – MSalters Nov 06 '17 at 10:16
My reasoning is simple: Time lost due to exceptions is `exceptionFrequency*handlingTime`. The `handlingTime` is huge, so, `exceptionFrequency` must vanish to allow the product to be small. Thus, if your exception is thrown only once a second, go ahead and use it (if you don't mind exceptions in your code, that is). If chances are that your exception is thrown more than a thousand times a second, it'll quickly become a major performance drain. Error conditions, however, tend to manifest in pretty much every single function, and be triggered regularly. Nothing to use exceptions for. – cmaster - reinstate monica Nov 06 '17 at 10:31
@cmaster: The point here is (since it's about branch-aware programming) that exceptions save time in the order of `(1-exceptionChance)*overheadOfErrorHandlingInNormalCase`. If you're calling a function a thousand times per second, and you have an error return value, it must be checked a thousand times per second. If that error is an exception, the compiler can optimize the no-error scenario. If the error is encoded as a negative integer, the compiler doesn't have that guidance. – MSalters Nov 06 '17 at 10:45
And in the time that you can throw/catch a single exception, you can easily check a thousand error conditions. – cmaster - reinstate monica Nov 06 '17 at 11:37
@cmaster: Apples and oranges. In the end, it's about flow control, and at the point where you've caught the exception you're already past the branch point in the flow (and you're on the slow branch). Merely checking the error code is not a flow control operation yet, and it's the branching which is expensive. (especially as this type of branches pollute the branch predictor cache - exceptions are typically not dependent on predicted branches as the compiler can use an assumed-no-taken branch. E.g. x86 opcode `0x2E`). – MSalters Nov 06 '17 at 12:52
Nope, no apples and oranges: If I know I can execute so many branches within the execution time of a single throw/catch, I can weigh the costs. I can know whether it's better to just check the error conditions, or to use an exception instead. My point is, that it's just so much more expensive to execute a throw/catch rather than an if/else, that you need to remove a lot of executed if/else statements from your hot execution path to make up for a single throw/catch. On average, the CPU must see less than a single throw for a several hundred if's it does not see to make exceptions faster. – cmaster - reinstate monica Nov 06 '17 at 13:52
Just ran a simple benchmark: On my system, throwing an exception up a single function level is 483 times more expensive than returning an error code from the function instead... – cmaster - reinstate monica Nov 06 '17 at 13:53

autistic · Answer 6 · 2015-09-15T13:16:00.363

1- Is it possible to avoid branch mispredictions using some high level programming technique (i.e. no assembly)?

Avoid? Perhaps not. Reduce? Certainly...

2- What should I keep in mind to produce branch-friendly code in a high level programming language (I'm mostly interested in C and C++)?

It is worth noting that optimisation for one machine isn't necessarily optimisation for another. With that in mind, profile-guided optimisation is reasonably good at rearranging branches, based on whichever test input you give it. This means you don't need to do any programming to perform this optimisation, and it should be relatively tailored to whichever machine you're profiling on. Obviously, the best results will be achieved when your test input and the machine you profile on roughly matches what common expectations... but those are also considerations for any other optimisations, branch-prediction related or otherwise.

Görkem Mülayim · Answer 7 · 2017-11-06T09:03:44.303

To answer your questions let me explain how does branch prediction works.

First of all, there is a branch penalty when the processor correctly predicts the taken branches. If the processor predicts a branch as taken, then it has to know the target of the predicted branch since execution flow will continue from that address. Assuming that the branch target address is already stored in Branch Target Buffer(BTB), it has to fetch new instructions from the address found in BTB. So you are still wasting a few clock cycles even if the branch is correctly predicted.
Since BTB has an associative cache structure the target address might not be present, and hence more clock cycles might be wasted.

On the other, hand if the CPU predicts a branch as not taken and if it's correct then there is no penalty since the CPU already knows where the consecutive instructions are.

As I explained above, predicted not taken branches have higher throughput than predicted taken branches.

Is it possible to avoid branch misprediction using some high level programming technique (i.e. no assembly)?

Yes, it is possible. You can avoid by organizing your code in way that all branches have repetitive branch pattern such that always taken or not taken.
But if you want to get higher throughput you should organize branches in a way that they are most likely to be not taken as I explained above.

What should I keep in mind to produce branch-friendly code in a high level programming language (I'm mostly interested in C and C++)?

If it's possible eliminate branches as possible. If this is not the case when writing if-else or switch statements, check the most common cases first to make sure the branches most likely to be not taken. Try to use __builtin_expect(condition, 1) function to force compiler to produce condition to be treated as not taken.

score 1 · Answer 8 · edited Jun 19 '18 at 15:35

Branchless isn't always better, even if both sides of the branch are trivial. When branch prediction works, it's faster than a loop-carried data dependency.

See gcc optimization flag -O3 makes code slower than -O2 for a case where gcc -O3 transforms an if() to branchless code in a case where it's very predictable, making it slower.

Sometimes you are confident that a condition is unpredictable (e.g. in a sort algorithm or binary search). Or you care more about the worst-case not being 10x slower than about the fast-case being 1.5x faster.

Some idioms are more likely to compile to a branchless form (like a cmov x86 conditional move instruction).

x = x>limit ? limit : x;   // likely to compile branchless

if (x>limit) x=limit;      // less likely to compile branchless, but still can

The first way always writes to x, while the second way doesn't modify x in one of the branches. This seems to be the reason that some compilers tend to emit a branch instead of a cmov for the if version. This applies even when x is a local int variable that's already live in a register, so "writing" it doesn't involve a store to memory, just changing the value in a register.

Compilers can still do whatever they want, but I've found this difference in idiom can make a difference. Depending on what you're testing, it's occasionally better to help the compiler mask and AND rather than doing a plain old cmov. I did it in that answer because I knew that the compiler would have what it needed to generate the mask with a single instruction (and from seeing how clang did it).

TODO: examples on http://gcc.godbolt.org/

In the code example the first ':' in the first line should be a '?'. — hexcoder, Jun 18 '18 at 11:52

Branch-aware programming

8 Answers8

Linked