How can you optimize jump instruction slowed by cpu branch miss?

Question

Background

It is known that random branching costs significant overhead. And there was a post in SO answering such question.

Similar performance impact can be seen with jump instructions in many CPU architectures. And there was a post in SO for such theme as well

So if you used programming patterns like function pointer or just normal inheritable C++ class based function call, we have to pay the cost of branch miss.

Even for the most advanced hardware branch prediction algorithms can only do globally shared address history based branch prediction, and perhaps it may speculatively fetch the branch target address code and so on.

But by definition, it won't work for the first execution.

Many embedded appliance, smartphones, etc should demand maximum performance at

boot time
first execution of an application like browser

Which calls millions of function calls and may not want to change there software architecture significantly like converting all indirect jumps into direct jumps...

If the conditions are as follows,

Conditions:

must run at top speed
- at the first run
  or
- the jump looks completely random to the cpu

Is the following best to achieve the result?

And I want to know any example that does dynamic/static code rewrite of indirect to direct jumps.

How to get maximum performance:

always use either branch likely or unlikely
use prelink
use mlock or readahead, cacheflush(ICACHE) the function call
rewrite indirect jumps to direct jumps dynamically or statically
(found a paper written in 1996 by Bradley M Kuhn for static rewrite for some case)

The paper I found was translating virtual function calls to static function call at source code level, but binary link time optimization seemed better for the software developer point of view.

Hmm, no, there's no possible branch mis-prediction penalty on an unconditional jump. — Hans Passant, Jun 28 '12 at 18:45
It's not very clear what you are really asking. But perhaps you are looking for manually inserting branch predictor hints? These are non-portable, but you could use macros such as `likely()` or `unlikely()` on some platforms. — TJD, Jun 28 '12 at 18:47
@HansPassant, you forgot "direct". You may have mis-prediction on unconditional indirect jumps. — AProgrammer, Jun 28 '12 at 18:47

score 2 · Answer 1 · answered Jun 28 '12 at 18:42

2

Are you saying you can predict with near-certainty where the branch will go, but the CPU cannot? A heavy cost is incurred when branch prediction fails; if your branch prediction table is hot and it's hitting the same path every time, I posit that branch prediction would likely succeed. (And if it's not consistent, then obviously prediction cannot succeed 100%.)

answered Jun 28 '12 at 18:42

StilesCrisis

15,972
4
39
62

Yes, I'm talking about the case the CPU cannot predict and the programmer can predict with additional information at compile time or at runtime. (replacing the indrect call to direct call in the text section only seems to be a solution... but I wanted to know if there was any instruction like branch likely or the like to tell the cpu branch prediction table is hot, or not) – holmes Jun 29 '12 at 00:41
If the call goes the same way every single time, and the call happens frequently enough, the branch predictor will figure it out without any intervention on your part. – StilesCrisis Jun 29 '12 at 04:41
Yes, but what if you wanted to change your mind and put different function pointer randomly? CPU cannot predict this branch pattern. – holmes Jun 29 '12 at 05:38
If you put in a different function pointer randomly, at some point you must incur a branch miss. Whether at the branch, or at some time before, there will be a test or prediction that goes the wrong way. You can shift where the miss happens, but you can't make it vanish. – StilesCrisis Jun 29 '12 at 18:08
I don't think so. At least, you can do things like rewrite indirect jump with direct jump and flush cache, at several hundred instructions before you hit the code, and cpu can successfully prefetch that part of code and run through in top speed. So I want to know if this is the only answer... – holmes Jun 29 '12 at 23:53
Well, if the address can be loaded several hundred cycles ahead of when it is to be used, it's entirely possible that you won't stall at all. Why are you asking specifically? – StilesCrisis Jun 30 '12 at 04:01
I thought even in that case, as long as you use indirect jumps, you would stall because branch prediction fails and even if you explicitly flush the cache, you cannot pin it, and CPU I/F decoder might not do it's job well. – holmes Jun 30 '12 at 04:39

score 1 · Answer 2 · answered Jun 28 '12 at 18:44

1

CPUs are already doing this for you, see Branch Prediction.

answered Jun 28 '12 at 18:44

Nikolai Fetissov

82,306
11
110
171

score 1 · Accepted Answer · answered Jun 28 '12 at 20:23

1

For function pointers and virtual functions you will need to predict the target of the branch, because you can't know at compile time which code needs to be executed next.

http://en.wikipedia.org/wiki/Branch_target_predictor

answered Jun 28 '12 at 20:23

Danny

405
2
9

How can you optimize jump instruction slowed by cpu branch miss?

Background

Conditions:

How to get maximum performance:

3 Answers3