Background
It is known that random branching costs significant overhead. And there was a post in SO answering such question.
Similar performance impact can be seen with jump instructions in many CPU architectures. And there was a post in SO for such theme as well
So if you used programming patterns like function pointer or just normal inheritable C++ class based function call, we have to pay the cost of branch miss.
Even for the most advanced hardware branch prediction algorithms can only do globally shared address history based branch prediction, and perhaps it may speculatively fetch the branch target address code and so on.
But by definition, it won't work for the first execution.
Many embedded appliance, smartphones, etc should demand maximum performance at
- boot time
- first execution of an application like browser
Which calls millions of function calls and may not want to change there software architecture significantly like converting all indirect jumps into direct jumps...
If the conditions are as follows,
Conditions:
- must run at top speed
- at the first run
or - the jump looks completely random to the cpu
- at the first run
Is the following best to achieve the result?
And I want to know any example that does dynamic/static code rewrite of indirect to direct jumps.
How to get maximum performance:
- always use either branch likely or unlikely
- use prelink
- use mlock or readahead, cacheflush(ICACHE) the function call
- rewrite indirect jumps to direct jumps dynamically or statically
(found a paper written in 1996 by Bradley M Kuhn for static rewrite for some case)
The paper I found was translating virtual function calls to static function call at source code level, but binary link time optimization seemed better for the software developer point of view.