12

A developer can use the __builtin_expect builtin to help the compiler understand in which direction a branch is likely to go.

In the future, we may get a standard attribute for this purpose, but as of today at least all of clang, icc and gcc support the non-standard __builtin_expect instead.

However, icc seems to generate oddly terrible code when you use it1. That is, code that is uses the builtin is strictly worse than the code without it, regardless of which direction the prediction is made.

Take for example the following toy function:

int foo(int a, int b)
{
  do {
     a *= 77;
  } while (b-- > 0);  
  return a * 77;
}

Out of the three compilers, icc is the only one that compiles this to the optimal scalar loop of 3 instructions:

foo(int, int):
..B1.2:                         # Preds ..B1.2 ..B1.1
        imul      edi, edi, 77                                  #4.6
        dec       esi                                           #5.12
        jns       ..B1.2        # Prob 82%                      #5.18
        imul      eax, edi, 77                                  #6.14
        ret          

Both gcc and Clang manage the miss the easy solution and use 5 instructions.

On the other hand, when you use likely or unlikely macros on the loop condition, icc goes totally braindead:

#define likely(x)   __builtin_expect((x), 1)
#define unlikely(x) __builtin_expect((x), 0)

int foo(int a, int b)
{

   do {
     a *= 77;
  } while (likely(b-- > 0));  

   return a * 77;
}

This loop is functionally equivalent to the previous loop (since __builtin_expect just returns its first argument), yet icc produces some awful code:

foo(int, int):
        mov       eax, 1                                        #9.12
..B1.2:                         # Preds ..B1.2 ..B1.1
        xor       edx, edx                                      #9.12
        test      esi, esi                                      #9.12
        cmovg     edx, eax                                      #9.12
        dec       esi                                           #9.12
        imul      edi, edi, 77                                  #8.6
        test      edx, edx                                      #9.12
        jne       ..B1.2        # Prob 95%                      #9.12
        imul      eax, edi, 77                                  #11.15
        ret                                                     #11.15

The function has doubled in size to 10 instructions, and (worse yet!) the critical loop has more than doubled to 7 instructions with a long critical dependency chain involving a cmov and other weird stuff.

The same is true if you use the unlikely hint and also across all icc versions (13, 14, 17) that godbolt supports. So the code generation is strictly worse, regardless of the hint, and regardless of the actual runtime behavior.

Neither gcc nor clang suffer any degradation when hints are used.

What's up with that?


1 At least in the first and subsequent examples I tried.

Community
  • 1
  • 1
BeeOnRope
  • 60,350
  • 16
  • 207
  • 386

2 Answers2

3

To me it seems an ICC bug. This code (available on godbolt)

int c;

do 
{
    a *= 77;
    c = b--;
} 
while (likely(c > 0));  

that simply use an auxiliary local var c, produces an output without the edx = !!(esi > 0) pattern

foo(int, int):
  ..B1.2:                         
    mov       eax, esi
    dec       esi
    imul      edi, edi, 77
    test      eax, eax
    jg        ..B1.2

still not optimal (it could do without eax), though.

I don't know if the official ICC policy about __builtin_expect is full support or just compatibility support.


This question seems better suited for the Official ICC forum.
I've tried posting this topic there but I'm not sure I've made a good job (I've been spoiled by SO).
If they answer me I'll update this answer.

EDIT
I've got and an answer at the Intel Forum, they recorded this issue in their tracking system.
As today, it seems a bug.

BeeOnRope
  • 60,350
  • 16
  • 207
  • 386
Margaret Bloom
  • 41,768
  • 5
  • 78
  • 124
  • Thanks, fingers crossed on the product forum post. I've generally had bad experiences with such posts (versus an SO query), but at least some parts of the Intel forum are better than average (i.e., actual product management eyeballs on various posts). – BeeOnRope Jan 20 '17 at 17:23
  • @BeeOnRope They haven't answered so far. I'll keep the tab pinned in my browser until Tuesday, after that I guess we'll never get an answer. I found the posting options of the forum confusing, maybe I've done something wrong. – Margaret Bloom Jan 20 '17 at 18:30
  • It looks fine to me. I've never posted in the `icc` forum, but some of the other ones are decent. – BeeOnRope Jan 20 '17 at 18:51
  • 1
    @BeeOnRope Intel recorded this issue in their tracking system. It could be a bug. We'll see :) – Margaret Bloom Jan 23 '17 at 10:52
  • 1
    Great! I also found that the problem mostly occurs when the expression passed to `likely` is more complex than a simple comparison - especially if it involves mutation. For e.g., you can write it [like this](https://godbolt.org/g/GRRAoE) which moves the `b--` into the loop like your example, but does away with `c` and the optimal loop is restored. – BeeOnRope Jan 24 '17 at 01:30
2

Don't let the instructions deceive you. What matters is performance.

Consider this rather crude test :

#include "stdafx.h"
#include <windows.h>
#include <iostream>

int foo(int a, int b) {
    do { a *= 7; } while (b-- > 0);
    return a * 7;
}

int fooA(int a, int b) {
    __asm {     
        mov     esi, b
        mov     edi, a
        mov     eax, a
        B1:                        
        imul    edi, edi, 7                           
        dec     esi                                         
        jns     B1      
        imul    eax, edi, 7    
    }
}

int fooB(int a, int b) {
    __asm {
        mov     esi, b
        mov     edi, a
        mov     eax, 1                                    
        B1:                        
        xor     edx, edx                              
        test    esi, esi                                   
        cmovg   edx, eax                                   
        dec     esi                                        
        imul    edi, edi, 7                                
        test    edx, edx                                   
        jne     B1      
        imul    eax, edi, 7
    }
}

int main() {
    DWORD start = GetTickCount();
    int j = 0;
    for (int aa = -10; aa < 10; aa++) {
        for (int bb = -500; bb < 15000; bb++) {
            j += foo(aa, bb);
        }
    }
    std::cout << "foo compiled (/Od)\n" << "j = " << j << "\n" 
        << GetTickCount() - start << "ms\n\n";

    start = GetTickCount();
    j = 0;
    for (int aa = -10; aa < 10; aa++) {
        for (int bb = -500; bb < 15000; bb++) {
            j += fooA(aa, bb);
        }
    }
    std::cout << "optimal scalar\n" << "j = " << j << "\n" 
        << GetTickCount() - start << "ms\n\n";

    start = GetTickCount();
    j = 0;
    for (int aa = -10; aa < 10; aa++) {
        for (int bb = -500; bb < 15000; bb++) {
            j += fooB(aa, bb);
        }
    }
    std::cout << "use likely \n" << "j = " << j << "\n" 
        << GetTickCount() - start << "ms\n\n";

    std::cin.get();
    return 0;
}

produces output:

foo compiled (/Od)
j = -961623752
4422ms

optimal scalar
j = -961623752
1656ms

use likely
j = -961623752
1641ms

This is naturally entirely CPU dependent (tested here on Haswell i7), but both asm loops generally are very nearly identical in performance when tested over a range of inputs. A lot of this has to do with the selection and ordering of instructions being conducive to leveraging instruction pipelining (latency), branch prediction, and other hardware optimizations in the CPU.

The real lesson when you're optimizing is that you need to profile - it's extremely difficult to do this by inspection of the raw assembly.

Even giving a challenging test where likely(b-- >0) isn't true over a third of the time :

for (int aa = -10000000; aa < 10000000; aa++) {
    for (int bb = -3; bb < 9; bb++) {
        j += fooX(aa, bb);
    }
}

results in :

foo compiled (/Od) : 1844ms

optimal scalar : 906ms

use likely : 1187ms

Which isn't bad. What you have to keep in mind is that the compiler will generally do its best without your interference. Using __builtin_expect and the like should really be restricted to cases where you have existing code that you have profiled and that you have specifically identified as being both hotspots and as having pipeline or prediction issues. This trivial example is an ideal case where the compiler will almost certainly do the right thing without help from you.

By including __builtin_expect you're asking the compiler to necessarily compile in a different way - a more complex way, in terms of pure number of instructions, but a more intelligent way in that it structures the assembly in a way that helps the CPU make better branch predictions. In this case of pure register play (as in this example) there's not much at stake, but if it improves prediction in a more complex loop, maybe saving you a bad misprediction, cache misses, and related collateral damage, then it's probably worth using.

I think it's pretty clear here, at least, that when the branch actually is likely then we very nearly recover the full performance of the optimal loop (which I think is impressive). In cases where the "optimal loop" is rather more complex and less trivial we can expect that the codegen would indeed improve branch prediction rates (which is what this is really all about). I think this is really a case of if you don't need it, don't use it.


On the topic of likely vs unlikely generating the same assembly, this doesn't imply that the compiler is broken - it just means that the same codegen is effective regardless of whether the branch is mostly taken or mostly not taken - as long as it is mostly something, it's good (in this case). The codegen is designed to optimise use of the instruction pipeline and to assist branch prediction, which it does. While we saw some reduction in performance with the mixed case above, pushing the loop to mostly unlikely recovers performance.

for (int aa = -10000000; aa < 10000000; aa++) {
    for (int bb = -30; bb < 1; bb++) {
        j += fooX(aa, bb);
    }
}

foo compiled (/Od) : 2453ms

optimal scalar : 1968ms

use likely : 2094ms

J...
  • 30,968
  • 6
  • 66
  • 143
  • To be fair, I'm not fooled. It _just happens_ that the loop carried `mul` dependency in my arbitrary example overwhelms the rest of the loop, but it's bad code generation by any metric (e.g., code size is terrible, and it will still suck in practice for small `b`). Let me fix the example so it actually sucks for your "big loop count" test. – BeeOnRope Jan 19 '17 at 01:55
  • 2
    @BeeOnRope Well... you told it that `likely(b-- >0)`. Why does it surprise you that the codegen is not ideal when that isn't true? I selected my `bb` variable range to fit the case where it is indeed likely that `b-- >0`. – J... Jan 19 '17 at 02:03
  • well, the codegen is not ideal in _any_ case: it is **strictly worse** for the `b == 0` case, the small `b` case, the medium `b` case and the large `b` case (in that case the difference is hard to measure though). Furthermore, as I noted, it produces _identical_ code when `likely` is replaced with `unlikely` - so it's not actually using the heuristic at all. – BeeOnRope Jan 19 '17 at 02:12
  • @BeeOnRope In *this case*, yes, it is indeed *strictly worse*, but you *explicitly* told the compiler to add some extra fancy to the codegen. For a ridiculously trivial loop I'm not surprised it doesn't work out well. – J... Jan 19 '17 at 02:40
  • @BeeOnRope As for `likely` and `unlikely` generating the same code, this doesn't imply a lack of heuristic. It's just that in this case it's not relevant whether the branch is likely or not, the codegen helps the instruction pipeline and prediction the same, regardless of whether the jump ends up mostly taken or not taken. As long as it's mostly either *taken* or *not taken* it becomes efficient. – J... Jan 19 '17 at 02:40
  • Sure, that's possible in theory - for example a compiler might use a `cmov` type flow for unpredictable branches, and then use the (same) `jcc` based flow if it knows the branches are skewed - but none of that is happening here - the compiler just generates totally crap code that makes no sense and is _strictly worse_ than the other code. It has no redeeming quality - if I'm missing some trick in the generated code, let me know, but basically no it's just bad by inspection. This is not a subtle case. – BeeOnRope Jan 19 '17 at 02:43
  • Anyway, I replaced the `*` in the example with `+` so the loop won't be bound by the carried `mul` latency. The result is that the MSVC C++ version runs in ~0.8s, the "good" `icc` version (without any `__builtin_expect` also in ~0.8s) and the bogus `icc` version in ~1.4s - and that's even being charitable: the compiler really should have unrolled the "good" version at least once so that a couple of `add` could run in the same cycle. Thanks for keeping me honest anyway: I'll work on updating the question (gah, all the links to fix...). – BeeOnRope Jan 19 '17 at 02:46
  • Here's the [gist](https://gist.github.com/travisdowns/167cbfeddf2551ac3b79490a90e0975b) for my updated version of your test. – BeeOnRope Jan 19 '17 at 02:49
  • @BeeOnRope I don't entirely disagree with you, to be sure. I guess my point is that the directive is probably not optimised for trivial loops like this in the `icc`. I still don't think that generally invalidates the logic of what it is doing - it's just not something that you should blindly paste into a codebase without being sure it's going to cool a hotspot where you're having bad (and correctable) misprediction issues or pipeline thrashing. – J... Jan 19 '17 at 02:53
  • @BeeOnRope Also consider that you're grasping at straws here, clawing a few cycles back by swapping `add` for `imul` and totally gutting any real work inside the loop. Remember that the pain of a misprediction or pipeline thrash doesn't come from the loop instructions themselves, it comes from the mess that's created by the *actual work in the loop* - a misprediction that trashes your cache or flushes an expensive pipeline is really what this is about tweaking. It would be more clear why this isn't so crazy when you start putting more actual work in the loop. – J... Jan 19 '17 at 02:57
  • I'm not gutting anything - the example is real and simple. Replacing `+` with `*` is no less valid as my choice of `*` in the first place was arbitrary. In the fact, the tight dependent multiplication is the unrealistic part: single-cycle latency operations are more common. On _real_ loops the exact same problem exhibits itself, but do you really need to dig through a big hairy loop with lots of code that doesn't help the problem? We all want a MCVE, right? I updated the example just to remove your concern that the code executed in almost the same time, so `icc` is being clever here. – BeeOnRope Jan 19 '17 at 02:59
  • @BeeOnRope Well, it's bedtime for me now... maybe if I'm bored one day I'll come back to this. I'd actually like to see someone run this through an intel profiler to count the mispredictions for each of the loops. In any case, I'm sure you could come up with an MCVE where the loop body actually has prediction issues and some memory access where the wacky `likely` codegen really does help. Single-cycle latency ops are *not* where you're worried about a misprediction! – J... Jan 19 '17 at 03:03
  • Yes, on `gcc` and `clang` pretty much any trivial use of `likely` and `unlikely` causes the compiler to make the "obvious" transformations (e.g., keeping the most likely pass "in line" and moving the unlikely case out of line). The "obvious" transformation here is probably no change at all - in both the "likely" and "unlike" case the simple loop already seems best. If you come back to it, just follow the added instructions that culminate in `test edx, edx` - they are nonsense: the asm version of `x = 0; x = b ? 1 : x; if (x) {...}` which of course is the same as `if (b) {...}`. – BeeOnRope Jan 19 '17 at 03:07
  • ... and don't get me wrong: of course performance is what matters and compilers can do all kind of tricky stuff that at first blush is stupid but turns out to be clever, but I'm sorry: this isn't one of those times. – BeeOnRope Jan 19 '17 at 03:08
  • @BeeOnRope For this specific case, no it isn't one of those times. The simplest answer to your question of `why?`, then, is that Intel probably spent more time working on the clever bits of the compiler than they did from saving devs from themselves. Ask an Intel engineer why `icc` does this and they'd probably just shake their head and turn the question around, asking you why on earth you had the gumption to stick `__builtin_expect` on such a ridiculous loop. After all, `icc` already gave you the 3op ideal codegen in the first place - **before** you went in there and started messing with it! – J... Jan 19 '17 at 03:12
  • Actually I fully expect them to say "huh, looks like a bug in icc codegen, thanks!". If you are really hung up on my loop, which is a trivial example with obviously terrible codegen, here's [one without a loop](https://godbolt.org/g/AurgxP). It is the _first_ thing I tried to mimic _exactly_ how this idiom is _actually_ used in high perf code - to hint to the compiler about if or if/else branch likelyhood (the loop case is less common). It fails in the same way, calculating `test ecx, ecx` in a nonsense way. Note the bug only occurs if `b` is dec'd inside the expect. Remove the `--` to see. – BeeOnRope Jan 19 '17 at 03:21
  • ... also, while we're still pursuing the world's longest comment thread record, I'd like the dispute the idea that "oh, this is a simple case, not worth optimizing" with the implication that somehow then `icc` will get the complex cases right. My experience is exactly the opposite. Compilers are usually **very good** at the simple cases, and have an increasing of oddball codegen with increasing complexity. You pretty much never see a case where a problem that shows up in a very simple case disapears in a more complex example. It either stays there or multiplies. – BeeOnRope Jan 19 '17 at 03:24