How do the likely/unlikely macros in the Linux kernel work and what is their benefit?

Question

I've been digging through some parts of the Linux kernel, and found calls like this:

if (unlikely(fd < 0))
{
    /* Do something */
}

or

if (likely(!err))
{
    /* Do something */
}

I've found the definition of them:

#define likely(x)       __builtin_expect(!!(x), 1)
#define unlikely(x)     __builtin_expect(!!(x), 0)

I know that they are for optimization, but how do they work? And how much performance/size decrease can be expected from using them? And is it worth the hassle (and losing the portability probably) at least in bottleneck code (in userspace, of course).

This really isn't specific to the Linux kernel or about macros, but a compiler optimization. Should this be retagged to reflect that? — Serafina Brocious, Sep 20 '08 at 23:13
The paper [What every Programmer should know about Memory](http://people.redhat.com/drepper/cpumemory.pdf) (p. 57) contains an in-depth explanation. — Torsten Marek, Sep 20 '08 at 23:15
Related: [a benchmark on the use of `__builtin_expect`](http://stackoverflow.com/a/35940041/5470596) on another question. — YSC, Mar 15 '16 at 14:53
There's no portability issue. You can trivially do things like `#define likely(x) (x)` and `#define unlikely(x) (x)` on platforms that don't support this kind of hinting. — David Schwartz, Apr 19 '16 at 07:25
According to the [kernelnewbies FAQ](http://kernelnewbies.org/FAQ/LikelyUnlikely) (and the latest 3.11 kernel source), the macros definitions are slightly different now: #define likely(x) __builtin_expect(!!(x), 1) #define unlikely(x) __builtin_expect(!!(x), 0) I guess this adds a little more to the confusion!? :) I don't get the need for double NOT (!!). — Mandeep Sandhu, Sep 03 '13 at 10:14
The double negation should be added to @DavidSchwartz’ flavours though. It’s important because people do things like `x=3; if (x) do something;` and this is implicit for `if (x != 0)` or `if (!!x)` (as `if (!x)` is `if (x == 0)`). — mirabilos, Nov 04 '22 at 19:44

score 415 · Accepted Answer · edited Jul 27 '23 at 18:14

415

They are hint to the compiler to emit instructions that will cause branch prediction to favour the "likely" side of a jump instruction. This can be a big win, if the prediction is correct it means that the jump instruction is basically free and will take zero cycles. On the other hand if the prediction is wrong, then it means the processor pipeline needs to be flushed and it can cost several cycles. So long as the prediction is correct most of the time, this will tend to be good for performance.

Like all such performance optimisations you should only do it after extensive profiling to ensure the code really is in a bottleneck, and probably given the micro nature, that it is being run in a tight loop. Generally the Linux developers are pretty experienced so I would imagine they would have done that. They don't really care too much about portability as they only target gcc, and they have a very close idea of the assembly they want it to generate.

Note that most ISAs don't have a way for the machine code to actually hint the hardware branch predictor, other than static prediction (backward taken / forward not-taken) on some. And on modern implementations like x86 since 2013 or so, even that's not a thing anymore:

The likely and unlikely macros or C++ [[likely]] / [[unlikely]] annotations can hint the compiler's branch layout to favour I-cache locality for the fast path, and minimize taken branches on the fast path. Also to hint the decision to make branchy vs. branchless asm when that's possible.

edited Jul 27 '23 at 18:14

Peter Cordes

328,167
45
605
847

answered Sep 20 '08 at 23:09

1800 INFORMATION

131,367
29
160
239

3

These macros mostly were used for error checking. Because error leaves less probably then normal operation. A few people make profiling or calculation to decide most used leaf... – gavenkoa May 04 '12 at 21:12
67

As regards the fragment `"[...]that it is being run in a tight loop"`, many CPUs have a [branch predictor](http://en.wikipedia.org/wiki/Branch_predictor), thus using these macros only helps the first time code is executed or when the history table is overwritten by a different branch with the same index into the branching table. In a tight loop, and assuming a branch goes one way most of the time, the branch predictor will likely begin guessing the correct branch very quickly. - your friend in pedantry. – Ross Rogers Nov 11 '13 at 21:56
17

@RossRogers: What really happens is the compiler arranges the branches so the common case is the not-taken one. This is faster even when branch prediction does work. Taken branches are problematic for instruction-fetch and decode even when they're predicted perfectly. Some CPUs statically predict branches that aren't in their history table, usually with assume not-taken for forward branches. Intel CPUs don't work that way: they don't try to check that the predictor table entry is for *this* branch, they just use it anyway. A hot branch and a cold branch might alias the same entry... – Peter Cordes Feb 15 '16 at 21:41
1

@Peter Cordes - I understand the branch table aliasing, that's why I wrote `the history table is overwritten by a different branch with the same index into the branching table.` I was just pointing out the tight loop thing. If you execute the loop over and over, the initial cost is trivial and the branch predictor takes over, unless you get branch prediction thrashing through jump/calls inside the "tight loop". Telling the the compiler to favor a branch is a micro-optimization in tight loops run many many times. All very pedantic, to be sure :-) – Ross Rogers Feb 16 '16 at 02:46
2

@RossRogers: My main point was that laying out the fast-path with mostly non-taken branches is good, and is a win even after branch predictors warm up (e.g. in a tight loop). – Peter Cordes Feb 16 '16 at 03:08
re: prediction: Intel CPUs for a the last few generations literally don't have any static branch prediction. Instead of a new branch evicting/overwriting an old entry in the BTB, it just uses it with whatever stale data was there before. So a cold branch aliasing a hot branch doesn't lose all the prediction history for the hot branch (just pollutes it a bit). There's no static prediction because the predictor can't tell that it hasn't seen a branch before. [Agner Fog's microarch doc has an early chapter about branch prediction](http://agner.org/optimize/). – Peter Cordes Feb 16 '16 at 03:12
28

This answer is mostly obsolete since the main claim is that it helps branch prediction, and as @PeterCordes points out, in most modern hardware there is no implicit or explicit static branch prediction. In fact the hint is used by the compiler to optimize the code, whether that involves static branch hints, or any other type of optimization. For most architectures today, it is the "any other optimization" that matters, e.g., making hot paths contiguous, better scheduling the hot path, minimizing the size of the slow path, vectorizing only the expected path, etc, etc. – BeeOnRope Jan 18 '17 at 21:54
4

@BeeOnRope because of cache prefetch and word size, there is still an advantage to running a program linearly. The next memory location will already be fetched and in cache, the branch target maybe or maybe not. With a 64 bit CPU you grab at least 64 bits at a time. Depending on DRAM interleave, it may be 2x 3x or more bits that get grabbed. – Bryce Feb 15 '17 at 19:39
3

Absolutely, there are all kind of pipeline-related reasons why linear code is preferred and they have nothing to do with static branch hits embedded in the instruction. Modern CPUs generally ignore those, so the entire rationale given in this answer is obsolete. @Bryce – BeeOnRope Feb 15 '17 at 19:57

score 135 · Answer 2 · edited Apr 13 '20 at 20:33

Let's decompile to see what GCC 4.8 does with it

Without __builtin_expect

#include "stdio.h"
#include "time.h"

int main() {
    /* Use time to prevent it from being optimized away. */
    int i = !time(NULL);
    if (i)
        printf("%d\n", i);
    puts("a");
    return 0;
}

Compile and decompile with GCC 4.8.2 x86_64 Linux:

gcc -c -O3 -std=gnu11 main.c
objdump -dr main.o

Output:

0000000000000000 <main>:
   0:       48 83 ec 08             sub    $0x8,%rsp
   4:       31 ff                   xor    %edi,%edi
   6:       e8 00 00 00 00          callq  b <main+0xb>
                    7: R_X86_64_PC32        time-0x4
   b:       48 85 c0                test   %rax,%rax
   e:       75 14                   jne    24 <main+0x24>
  10:       ba 01 00 00 00          mov    $0x1,%edx
  15:       be 00 00 00 00          mov    $0x0,%esi
                    16: R_X86_64_32 .rodata.str1.1
  1a:       bf 01 00 00 00          mov    $0x1,%edi
  1f:       e8 00 00 00 00          callq  24 <main+0x24>
                    20: R_X86_64_PC32       __printf_chk-0x4
  24:       bf 00 00 00 00          mov    $0x0,%edi
                    25: R_X86_64_32 .rodata.str1.1+0x4
  29:       e8 00 00 00 00          callq  2e <main+0x2e>
                    2a: R_X86_64_PC32       puts-0x4
  2e:       31 c0                   xor    %eax,%eax
  30:       48 83 c4 08             add    $0x8,%rsp
  34:       c3                      retq

The instruction order in memory was unchanged: first the printf and then puts and the retq return.

With __builtin_expect

Now replace if (i) with:

if (__builtin_expect(i, 0))

and we get:

0000000000000000 <main>:
   0:       48 83 ec 08             sub    $0x8,%rsp
   4:       31 ff                   xor    %edi,%edi
   6:       e8 00 00 00 00          callq  b <main+0xb>
                    7: R_X86_64_PC32        time-0x4
   b:       48 85 c0                test   %rax,%rax
   e:       74 11                   je     21 <main+0x21>
  10:       bf 00 00 00 00          mov    $0x0,%edi
                    11: R_X86_64_32 .rodata.str1.1+0x4
  15:       e8 00 00 00 00          callq  1a <main+0x1a>
                    16: R_X86_64_PC32       puts-0x4
  1a:       31 c0                   xor    %eax,%eax
  1c:       48 83 c4 08             add    $0x8,%rsp
  20:       c3                      retq
  21:       ba 01 00 00 00          mov    $0x1,%edx
  26:       be 00 00 00 00          mov    $0x0,%esi
                    27: R_X86_64_32 .rodata.str1.1
  2b:       bf 01 00 00 00          mov    $0x1,%edi
  30:       e8 00 00 00 00          callq  35 <main+0x35>
                    31: R_X86_64_PC32       __printf_chk-0x4
  35:       eb d9                   jmp    10 <main+0x10>

The printf (compiled to __printf_chk) was moved to the very end of the function, after puts and the return to improve branch prediction as mentioned by other answers.

So it is basically the same as:

int main() {
    int i = !time(NULL);
    if (i)
        goto printf;
puts:
    puts("a");
    return 0;
printf:
    printf("%d\n", i);
    goto puts;
}

This optimization was not done with -O0.

But good luck on writing an example that runs faster with __builtin_expect than without, CPUs are really smart these days. My naive attempts are here.

C++20 [[likely]] and [[unlikely]]

C++20 has standardized those C++ built-ins: How to use C++20's likely/unlikely attribute in if-else statement They will likely (a pun!) do the same thing.

score 83 · Answer 3 · edited Oct 19 '18 at 17:55

These are macros that give hints to the compiler about which way a branch may go. The macros expand to GCC specific extensions, if they're available.

GCC uses these to to optimize for branch prediction. For example, if you have something like the following

if (unlikely(x)) {
  dosomething();
}

return x;

Then it can restructure this code to be something more like:

if (!x) {
  return x;
}

dosomething();
return x;

The benefit of this is that when the processor takes a branch the first time, there is significant overhead, because it may have been speculatively loading and executing code further ahead. When it determines it will take the branch, then it has to invalidate that, and start at the branch target.

Most modern processors now have some sort of branch prediction, but that only assists when you've been through the branch before, and the branch is still in the branch prediction cache.

There are a number of other strategies that the compiler and processor can use in these scenarios. You can find more details on how branch predictors work at Wikipedia: http://en.wikipedia.org/wiki/Branch_predictor

Also, it impacts icache footprint - by keeping unlikely snippets of code out of the hot path. — fche, Mar 03 '15 at 01:58
More precisely, it can do it with `goto`s without repeating the `return x`: http://stackoverflow.com/a/31133787/895245 — Ciro Santilli OurBigBook.com, Jun 30 '15 at 09:10

score 11 · Answer 4 · answered Sep 20 '08 at 23:11

11

They cause the compiler to emit the appropriate branch hints where the hardware supports them. This usually just means twiddling a few bits in the instruction opcode, so code size will not change. The CPU will start fetching instructions from the predicted location, and flush the pipeline and start over if that turns out to be wrong when the branch is reached; in the case where the hint is correct, this will make the branch much faster - precisely how much faster will depend on the hardware; and how much this affects the performance of the code will depend on what proportion of the time hint is correct.

For instance, on a PowerPC CPU an unhinted branch might take 16 cycles, a correctly hinted one 8 and an incorrectly hinted one 24. In innermost loops good hinting can make an enormous difference.

Portability isn't really an issue - presumably the definition is in a per-platform header; you can simply define "likely" and "unlikely" to nothing for platforms that do not support static branch hints.

answered Sep 20 '08 at 23:11

moonshadow

86,889
7
82
122

4

For the record, x86 does take additional space for branch hints. You have to have a one-byte prefix on branches to specify the appropriate hint. Agreed that hinting is a Good Thing (TM), though. – Serafina Brocious Sep 20 '08 at 23:16
2

Dang CISC CPUs and their variable-length instructions ;) – moonshadow Sep 20 '08 at 23:26
4

Dang RISC CPUs -- Stay away from my 15-byte instructions ;) – Serafina Brocious Sep 20 '08 at 23:33
9

@CodyBrocious: branch hinting was introduced with P4, but was abandoned along with P4. All other x86 CPUs simply ignore those prefixes (because prefixes are always ignored in contexts where they're meaningless). These macros *don't* cause gcc to actually emit branch-hint prefixes on x86. They do help you get gcc to lay out your function with fewer taken branches on the fast-path. – Peter Cordes Feb 15 '16 at 21:47

score 9 · Answer 5 · edited Jun 16 '23 at 21:53

9

long __builtin_expect(long EXP, long C);

This construct tells the compiler that the expression EXP most likely will have the value C. The return value is EXP. __builtin_expect is meant to be used in an conditional expression. In almost all cases will it be used in the context of boolean expressions in which case it is much more convenient to define two helper macros:

#define unlikely(expr) __builtin_expect(!!(expr), 0)
#define likely(expr) __builtin_expect(!!(expr), 1)

These macros can then be used as in:

if (likely(a > 1))

Reference

edited Jun 16 '23 at 21:53

user16217248

3,119
19
19
37

answered Nov 23 '16 at 13:22

Ashish Maurya

131
1
4

1

As was asked in a comment to another answer - what is the reason for the double inversion in the macros (i.e. why use `__builtin_expect(!!(expr),0)` instead of just `__builtin_expect((expr),0)`? – Michael Firth Jan 22 '19 at 13:15
4

@MichaelFirth "double inversion" `!!` is equivalent to casting something to a `bool`. Some people like to write it this way. – Ben XO Aug 06 '19 at 13:05

score 5 · Answer 6 · answered Sep 20 '08 at 23:19

5

(general comment - other answers cover the details)

There's no reason that you should lose portability by using them.

You always have the option of creating a simple nil-effect "inline" or macro that will allow you to compile on other platforms with other compilers.

You just won't get the benefit of the optimization if you're on other platforms.

answered Sep 20 '08 at 23:19

Andrew Edgecombe

39,594
3
35
61

2

You don't use portability - the platforms that don't support them just define them to expand to empty strings. – sharptooth Sep 30 '11 at 08:36
2

I think you two are actually agreeing with each other -- it's just phrased confusingly. (From the looks of it, Andrew's comment is saying "you can use them without losing portability" but sharptooth thought that he said "don't use them as they're not portable" and objected.) – Miral May 08 '12 at 00:31

score 3 · Answer 7 · edited May 23 '17 at 12:26

As per the comment by Cody, this has nothing to do with Linux, but is a hint to the compiler. What happens will depend on the architecture and compiler version.

This particular feature in Linux is somewhat mis-used in drivers. As osgx points out in semantics of hot attribute, any hot or cold function called with in a block can automatically hint that the condition is likely or not. For instance, dump_stack() is marked cold so this is redundant,

 if(unlikely(err)) {
     printk("Driver error found. %d\n", err);
     dump_stack();
 }

Future versions of gcc may selectively inline a function based on these hints. There have also been suggestions that it is not boolean, but a score as in most likely, etc. Generally, it should be preferred to use some alternate mechanism like cold. There is no reason to use it in any place but hot paths. What a compiler will do on one architecture can be completely different on another.

score 3 · Answer 8 · edited May 02 '22 at 09:09

In many linux release, you can find compiler.h in /usr/linux/ , you can include it for use simply. And another opinion, unlikely() is more useful rather than likely(), because

if ( likely( ... ) ) {
     doSomething();
}

it can be optimized as well in many compiler.

And by the way, if you want to observe the detail behavior of the code, you can do simply as follow:

gcc -c test.c objdump -d test.o > obj.s

Then, open obj.s, you can find the answer.

score 1 · Answer 9 · answered Sep 20 '08 at 23:07

They're hints to the compiler to generate the hint prefixes on branches. On x86/x64, they take up one byte, so you'll get at most a one-byte increase for each branch. As for performance, it entirely depends on the application -- in most cases, the branch predictor on the processor will ignore them, these days.

Edit: Forgot about one place they can actually really help with. It can allow the compiler to reorder the control-flow graph to reduce the number of branches taken for the 'likely' path. This can have a marked improvement in loops where you're checking multiple exit cases.

gcc never generates x86 branch hints - at least all Intel CPUs would ignore them anyway. It will try to limit code size in unlikely regions by avoiding inlining and loop unrolling, though. — alex strange, Oct 31 '08 at 19:19

score 1 · Answer 10 · answered Sep 20 '08 at 23:08

These are GCC functions for the programmer to give a hint to the compiler about what the most likely branch condition will be in a given expression. This allows the compiler to build the branch instructions so that the most common case takes the fewest number of instructions to execute.

How the branch instructions are built are dependent upon the processor architecture.

How do the likely/unlikely macros in the Linux kernel work and what is their benefit?

10 Answers10

Linked

Related