Does GCC generate suboptimal code for static branch prediction?

Question

From my university course, I heard, that by convention it is better to place more probable condition in if rather than in else, which may help the static branch predictor. For instance:

if (check_collision(player, enemy)) { // very unlikely to be true
    doA();
} else {
    doB();
}

may be rewritten as:

if (!check_collision(player, enemy)) {
    doB();
} else {
    doA();
}

I found a blog post Branch Patterns, Using GCC, which explains this phenomenon in more detail:

Forward branches are generated for if statements. The rationale for making them not likely to be taken is that the processor can take advantage of the fact that instructions following the branch instruction may already be placed in the instruction buffer inside the Instruction Unit.

next to it, it says (emphasis mine):

When writing an if-else statement, always make the "then" block more likely to be executed than the else block, so the processor can take advantage of instructions already placed in the instruction fetch buffer.

Ultimately, there is article, written by Intel, Branch and Loop Reorganization to Prevent Mispredicts, which summarizes this with two rules:

Static branch prediction is used when there is no data collected by the microprocessor when it encounters a branch, which is typically the first time a branch is encountered. The rules are simple:

A forward branch defaults to not taken

A backward branch defaults to taken

In order to effectively write your code to take advantage of these rules, when writing if-else or switch statements, check the most common cases first and work progressively down to the least common.

As I understand, the idea is that pipelined CPU may follow the instructions from the instruction cache without breaking it by jumping to another address within code segment. I am aware, though, that this may be largly oversimplified in case modern CPU microarchitectures.

However, it looks like GCC doesn't respect these rules. Given the code:

extern void foo();
extern void bar();

int some_func(int n)
{
    if (n) {
        foo();
    }
    else {
        bar();
    }
    return 0;
}

it generates (version 6.3.0 with -O3 -mtune=intel):

some_func:
        lea     rsp, [rsp-8]
        xor     eax, eax
        test    edi, edi
        jne     .L6            ; here, forward branch if (n) is (conditionally) taken
        call    bar
        xor     eax, eax
        lea     rsp, [rsp+8]
        ret
.L6:
        call    foo
        xor     eax, eax
        lea     rsp, [rsp+8]
        ret

The only way, that I found to force the desired behavior is by rewriting the if condition using __builtin_expect as follows:

if (__builtin_expect(n, 1)) { // force n condition to be treated as true

so the assembly code would become:

some_func:
        lea     rsp, [rsp-8]
        xor     eax, eax
        test    edi, edi
        je      .L2             ; here, backward branch is (conditionally) taken
        call    foo
        xor     eax, eax
        lea     rsp, [rsp+8]
        ret
.L2:
        call    bar
        xor     eax, eax
        lea     rsp, [rsp+8]
        ret

http://stackoverflow.com/q/109710/905902 The linux kernel uses macros (alling the __builtin_expect) to use the a priory knowledge about the conditional branches. — wildplasser, Jan 26 '17 at 19:12
Modern Intel CPUs don't use static branch prediction. I also don't think GCC promises anywhere to consider the "true" clause of an if/else statement to be the most likely alternative. You're supposed to use `__builtin_expect`, like wildplasser mentioned, to tell it which is more likely. Or better yet, profile guided optimization. — Ross Ridge, Jan 26 '17 at 19:51
@RossRidge: Could you point to some reference? The Intel's article, that I linked also says that `Static branch prediction is used by the microprocessor the first time a conditional branch is encountered, and dynamic branch prediction is used for succeeding executions of the conditional branch code.`, so I guess there might be some overhead at the beginning, due to missed static branch prediction. — Grzegorz Szpetkowski, Jan 26 '17 at 19:57
See Anger Fog's microarchitecture manual. Section 3.16 "Static Prediction in PM and Core 2": "These processors do not use static prediction. The predictor simply makes a random prediction the first time a branch is seen, depending on what happens to be in the BTB entry that is assigned to the new branch.". http://www.agner.org/optimize/ — Ross Ridge, Jan 26 '17 at 20:01
If your program is short enough that you can notice the difference from a single bad branch prediction, then thinking about optimization is a waste of time and brain cells. — rici, Jan 26 '17 at 20:02
@rici: You are right. The example is oversimplified, so it's easier to analyze. For larger program, which contains a lot of branches (especially if it exceeds limits of CPU branch prediction cache), the difference may become significant. — Grzegorz Szpetkowski, Jan 26 '17 at 20:11
Even in a full scale program its unlikely to matter. Unless you're using a processor with only static prediction most jumps are going to be dynamically predicted. — Ross Ridge, Jan 26 '17 at 20:24
@grzegorz: i was reacting to "I guess there might be *some overhead at the beginning*, due to missed static branch prediction" (emphasis added) and i think the same principle applies. One thing that sometimes invalidates branch prediction caches is misguidedly unrolling loops thereby increasing the number of branch points. I know that's not directly relevant here but I offer it as a cautionary tale. — rici, Jan 26 '17 at 20:59
As with any optimization as low-level as this, you have to have fixed target platform, and profile the real production code. Any simplification (profiling of simplified sample) or theoretical reasoning is useless at this level, as the final performance is result of many interconnected tiny details and rules, fixing one of them may break 5 others, in the end ruining the performance. While it certainly it is possible to reason about this for human brain, and even to figure correct optimal way, it would take incredible budget of reasoning. Not worth it, when you can simply write code and profile. — Ped7g, Jan 27 '17 at 01:25
And compilers of course do generate sub-optimal machine code. Nobody wants to compile with NP-full algorithms just to gain 1-10% performance. As the "optimal" code is NP-full type of problem. — Ped7g, Jan 27 '17 at 01:27
For some reason, gcc's profile_estimate pass guesses that n has 54% chances of being 0... (see `-fdump-tree-all-all`) Normally it has a heuristic that == is more likely false, but it doesn't seem used here. You could file it on gcc's bugzilla to ask about it. Note that if you compile with `-fprofile-generate`, then run your program, then recompile with `-fprofile-use`, gcc will have access to real statistics and make better decisions. — Marc Glisse, Jan 27 '17 at 13:41
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79489 contains a detailed analysis, I highly recommend reading Martin's comment. — Marc Glisse, Feb 14 '17 at 09:01
How are the rules obeyed/disobeyed when a *chain* of branches is executed? — autistic, May 18 '17 at 07:23

score 4 · Answer 1 · answered Oct 09 '17 at 10:53

The short answer: no, it is not.

GCC does metrics ton of non trivial optimization and one of them is guessing branch probabilities judging by control flow graph.

According to GCC manual:

fno-guess-branch-probability

Do not guess branch probabilities using heuristics.

GCC uses heuristics to guess branch probabilities if they are not provided by profiling feedback (-fprofile-arcs). These heuristics are based on the control flow graph. If some branch probabilities are specified by __builtin_expect, then the heuristics are used to guess branch probabilities for the rest of the control flow graph, taking the __builtin_expect info into account. The interactions between the heuristics and __builtin_expect can be complex, and in some cases, it may be useful to disable the heuristics so that the effects of __builtin_expect are easier to understand.

-freorder-blocks may swap branches as well.

Also, as OP mentioned the behavior might be overridden with __builtin_expect.

Proof

Look at the following listing.

void doA() { printf("A\n"); }
void doB() { printf("B\n"); }
int check_collision(void* a, void* b)
{ return a == b; }

void some_func (void* player, void* enemy) {
    if (check_collision(player, enemy)) {
        doA();
    } else {
        doB();
    }
}

int main() {
    // warming up gcc statistic
    some_func((void*)0x1, NULL);
    some_func((void*)0x2, NULL);
    some_func((void*)0x3, NULL);
    some_func((void*)0x4, NULL);
    some_func((void*)0x5, NULL);
    some_func(NULL, NULL);
    return 0;
}

It is obvious that check_collision will return 0 most of the times. So, the doB() branch is likely and GCC can guess this:

gcc -O main.c -o opt.a
objdump -d opt.a

The asm of some_func is:

sub    $0x8,%rsp
cmp    %rsi,%rdi
je     6c6 <some_func+0x18>
mov    $0x0,%eax
callq  68f <doB>
add    $0x8,%rsp
retq   
mov    $0x0,%eax
callq  67a <doA>
jmp    6c1 <some_func+0x13>

But for sure, we can enforce GCC from being too smart:

gcc -fno-guess-branch-probability main.c -o non-opt.a
objdump -d non-opt.a

And we will get:

push   %rbp
mov    %rsp,%rbp
sub    $0x10,%rsp
mov    %rdi,-0x8(%rbp)
mov    %rsi,-0x10(%rbp)
mov    -0x10(%rbp),%rdx
mov    -0x8(%rbp),%rax
mov    %rdx,%rsi
mov    %rax,%rdi
callq  6a0 <check_collision>
test   %eax,%eax
je     6ef <some_func+0x33>
mov    $0x0,%eax
callq  67a <doA>
jmp    6f9 <some_func+0x3d>
mov    $0x0,%eax
callq  68d <doB>
nop
leaveq 
retq

So GCC will leave branches in source order.

I used gcc 7.1.1 for those tests.

To be fair, you should compile both versions with the same `-O` flag, and then include `-fno-guess-branch-probability` in the second. The code without optimization is totally different than the first one with `-O` and you can't really conclude it's just the `-fno-guess-branch-probability` flag that changed the block order because there a dozens of other flags and optimizations applied to the former but not the latter. — BeeOnRope, Oct 18 '17 at 22:37

score -1 · Answer 2 · edited Jun 20 '20 at 09:12

-1

I Think That You've Found A "Bug"

The funny thing is that optimization for space and no optimization are the only cases in which the "optimal" instruction code is generated: gcc -S [-O0 | -Os] source.c

some_func:
FB0:
       pushl   %ebp
       movl    %esp, %ebp
       subl    $8, %esp
       cmpl    $0, 8(%ebp)
       je      L2
       call    _foo
       jmp     L3
2:
       call    _bar
3:
       movl    $0, %eax
       # Or, for -Os:
       # xorl    %eax, %eax
       leave
       ret

My point is that ...

some_func:
FB0:
       pushl   %ebp
       movl    %esp, %ebp
       subl    $8, %esp
       cmpl    $0, 8(%ebp)
       je      L2
       call    _foo

... up to & through the call to foo everything is "optimal", in the traditional sense, regardless of the exit strategy.

Optimality is ultimately determined by the processor, of course.

edited Jun 20 '20 at 09:12

Community

1
1

answered Jul 27 '17 at 22:42

veganaiZe

539
5
13

1

Really? Another down vote with no explanation? How does that help anybody? This is the exact assembly output (minus directives) and it **is** *optimal for traditional CPUs* -- it jumps on the else, not the true condition. The corresponding options to `gcc` are counter-intuitive. There is no optimal code, in this situation, for modern x86 CPUs. – veganaiZe Sep 13 '17 at 02:34
The entire function you show definitely isn't optimal for any CPU. It uses `mov $0, %eax` to zero the return value, instead of [`xor %eax,%eax`](https://stackoverflow.com/questions/33666617/what-is-the-best-way-to-set-a-register-to-zero-in-x86-assembly-xor-mov-or-and). (This must be the `-O0` output, not the `-Os` output.) Also, tail duplication would avoid the `jmp L3` on the `call foo` path. Also, no point setting up a frame pointer. Not downvoting because it's at least interesting to see how gcc lays out branches here. – Peter Cordes Oct 09 '17 at 11:21
We're specifically speaking about *branch prediction*. So that entails a `jmp` instruction, at some point. The point is that there is ***no*** `jmp` to call `foo`... the `jmp` is to call `bar`. And the only difference between `-O0` and `-Os`, in this sitch, is how zero is returned to the calling environment. --The code *is* optimal for (older) x86 CPUs using traditional branch prediction (ie. no `jmp` when taking the `then` case). – veganaiZe Oct 16 '17 at 19:36
I'm talking about the `jmp`, not the `je`. Of course you need a `je` (unless you branchlessly choose between two function pointers), but the code following the `je` could be `call _foo` / `xor %eax,%eax` / `leave` / `ret` to avoid a `jmp`. Tail duplication increases code size but decreases dynamic instruction count, and decreases the amount of taken jumps. For the `call _foo` case, there are no taken `jmp` or `jcc` instructions. – Peter Cordes Oct 17 '17 at 01:09
*Branch prediction*, in this case, is concerned with the ***jump*** *before* the `call`. The `jmp L3` is, to my knowledge, just a necessary side-effect; to get to the exit code. – veganaiZe Oct 18 '17 at 22:05
Even unconditional branches need to be predicted before they're decoded to avoid decode bubbles. They can have at least a small negative impact on the front-end. And no, you don't need them. See the asm in the question, which does exactly the kind of tail-duplication I was talking about. Each `call` is followed by a duplicated block of a couple instructions ending with `ret`, instead of using a `jmp` to a shared tail. (And BTW, I haven't downvoted this answer, even though I disagree that it's a bug. It's a somewhat interesting data point.) – Peter Cordes Oct 18 '17 at 22:17
Thank you for taking the time to discuss this Peter. I respect your opinion. So, directing the conversation back towards the point which I'm attempting to make (along with the OP), why does `-O3` *jump* to `foo` while `-Os` / `-O0` do not? Would a traditional x86 not attempt to prefetch the *else* (`bar`) instead of the *then* (`foo`), in such case? As is, it seems to contradict the expected behavior. – veganaiZe Oct 20 '17 at 01:58
As [@ivaigult's answer explains](https://stackoverflow.com/questions/41880779/does-gcc-generate-suboptimal-code-for-static-branch-prediction/46644725#46644725) gcc guesses that `if(n)` will be false more often than true, so it lays out the branches on purpose to put `call bar` on the path with no taken branches. This is the intended behaviour. If you think you know which way a branch will go, you can use `__builtin_expect` (or [a `likely` / `unlikely` wrapper macro](https://stackoverflow.com/a/40765708/224132)) to tell gcc about it. – Peter Cordes Oct 20 '17 at 02:03
I guess `-Os` doesn't enable `-fguess-branch-probability`, or else it makes a different guess than `-O3`. (I just checked on godbolt: `-Os` does include `-fguess-branch-probability`. https://godbolt.org/g/WLiZGz) Anyway, gcc's documentation is pretty that it doesn't just assume the first part of an if/else is the fast-path. – Peter Cordes Oct 20 '17 at 02:05

Does GCC generate suboptimal code for static branch prediction?

2 Answers2

Proof

I Think That You've Found A "Bug"