Optimizing branch predictions: how to generalize code that could run wth different compiler, interperter, and hardware prediction?

Question

I ran into some slow downs on a tight loop today caused by an If statement, which surprised me some because I expected branch prediction to successfully pipeline the particular statement to minimize the cost of the conditional.

When I sat down to think more about why it wasn't better handled I realized I didn't know much about how branch prediction was being handled at all. I know the concept of branch prediction quite well and it's benefits, but the problem is that I didn't know who was implementing it and what approach they were utilizing for predicting the outcome of a conditional.

Looking deeper I know branch prediction can be done at a few levels:

Hardware itself with instruction pipelining
C++ style compiler
Interpreter of interpreted language.
half-compiled language like java may do two and three above.

However, because optimization can be done in many areas I'm left uncertain as to how to anticipate branch prediction. If I'm writing in Java, for example, is my conditional optimized when compiled, when interpreted, or by the hardware after interpretation!? More interesting, does this mean if someone uses a different runtime enviroment? Could a different branch prediction algorithm used in a different interpreter result in a tight loop based around a conditional showing significant different performance depending on which interpreter it's run with?

Thus my question, how does one generalize an optimization around branch prediction if the software could be run on very different computers which may mean different branch prediction? If the hardware and interpreter could change their approach then profiling and using whichever approach proved fastest isn't a guarantee. Lets ignore C++ where you have compile level ability to force this, looking at the interpreted languages if someone still needed to optimize a tight loop within them.

Are there certain presumptions that are generally safe to make regardless of interpreter used? Does one have to dive into the intricate specification of a language to make any meaningful presumption about branch prediction?

This is a bit broad. Very generally, analyze and get an average of the hardware it'll run on and try to optimize from that. — edmz, Jan 11 '16 at 20:35
I wouldn't target an interpreted language, as these have "hidden overhead" that may involve branches on which you have no control. — , Jan 11 '16 at 20:47
About the only "portable" measure you can take is to avoid conditional branches when you can. See http://stackoverflow.com/a/17828251/1196549 — , Jan 11 '16 at 20:49

score 2 · Accepted Answer · edited May 23 '17 at 12:15

Short answer:

To help improve the performance of the branch predictor try to structure your program so that conditional statements don't depend on apparently random data.

Details

One of the other answers to this question claims:

There is no way to do anything at the high level language to optimize for branch prediction, caching sure, sometimes you can, but branch prediction, no not at all.

However, this is simply not true. A good illustration of this fact comes from one of the most famous questions on Stack Overflow.

All branch predictors work by identifying patterns of repeated code execution and using this information to predict the outcome and/or target of branches as necessary.

When writing code in a high-level language it's typically not necessary for an application programmer to worry about trying to optimizing conditional branches. For instance gcc has the __builtin_expect function which allows the programmer to specify the expected outcome of a conditional branch. But even if an application programmer is certain they know the typical outcome of a specific branch it's usually not necessary to use the annotation. In a hot loop using this directive is unlikely to help improve performance. If the branch really is strongly biased the the predictor will be able to correctly predict the outcome most of the time even without the programmer annotation.

On most modern processors branch predictors perform incredibly well (better than 95% accurate even on complex workloads). So as a micro-optimization, trying to improve branch prediction accuracy is probably not something that an application programmer would want to focus on. Typically the compiler is going to do a better job of generating optimal code that works for the specific hardware platform it is targeting.

But branch predictors rely on identifying patterns, and if an application is written in such a way that patterns don't exist, then the branch predictor will perform poorly. If the application can be modified so that there is a pattern then the branch predictor has a chance to do better. And that is something you might be able to consider at the level of a high-level language, if you find a situation where a branch really is being poorly predicted.

old_timer · Answer 2 · 2016-01-11T21:07:41.920

branch prediction like caching and pipelining are things done to make code run faster in general overcoming bottlenecks in the system (super slow cheap dram which all dram is, all the layers of busses between X and Y, etc).

There is no way to do anything at the high level language to optimize for branch prediction, caching sure, sometimes you can, but branch prediction, no not at all. in order to predict, the core has to have the branch in the pipe along with the instructions that preceed it and across architectures and implementations not possible to find one rule that works. Often not even within one architecture and implementation from the high level language.

you could also easily end up in a situation where tuning for branch predictions you de-tune for cache or pipe or other optimizations you might want to use instead. and the overall performance first and foremost is application specific then after that something tuned to that application, not something generic.

For as much as I like to preach and do optimizations at the high level language level, branch prediction is one that falls into the premature optimization category. Just enable it it in the core if not already enabled and sometimes it saves you a couple of cycles, most of the time it doesnt, and depending on the implementation, it can cost more cycles than it saves. Like a cache it has to do with the hits vs misses, if it guesses right you have code in a faster ram sooner on its way to the pipe, if it guesses wrong you have burned bus cycles that could have been used by code that was going to be run.

Caching is usually a benefit (although not hard to write high level code that shows it costing performance instead of saving) as code usually runs linearly for some number of instructions before branching. Likewise data is accessed in order often enough to overcome the penalties. Branching is not something we do every instruction and where we branch to does not have a common answer.

Your backend could try to tune for branch prediction by having the pre-branch decisions happen a few cycles before the branch but all within a pipe size and tuned for fetch line or cache line alignments. again this messes with tuning for other features in the core.

"There is no way to do anything at the high level language to optimize for branch prediction" I disagree. One thing you can do in a high-level language is to _eliminate_ branches by expressing the problem in terms of lookups or arithmetic. This helps branch prediction work better on the remaining branches, because there's more "history" available. I've made huge performance improvements to bottleneck code with this approach. — Adrian McCarthy, Jan 11 '16 at 23:53
It would have to be straight code, no function calls, very limited as to what math you can do, etc. Sure you can sustain that for a little bit but not extended periods. likely could have been optimized and not been as many lines, making the next branch sooner. — old_timer, Jan 12 '16 at 00:39
"straight code, no function calls" which is where branch prediction is likely to make the most difference anyway. Great for tight loops, but less interesting for general code. — Adrian McCarthy, Jan 12 '16 at 00:44

Optimizing branch predictions: how to generalize code that could run wth different compiler, interperter, and hardware prediction?

2 Answers2

Short answer:

Details