Why are gcc's switch-generated jumps faster than equivalent function calls but only with static linking?

Question

I was curious about how switch jumps perform relative to function calls so I whipped out a quick benchmark:

#!/bin/bash -eu
cat > get.c <<EOF
#include <stdint.h>
int get(int (Getter)(void))
{
    uintptr_t getter=(uintptr_t)Getter; 
    if(1){
        switch(getter){
        case 0: return $RANDOM;
        case 1: return $RANDOM;
        case 2: return $RANDOM;
        case 3: return $RANDOM;
        case 4: return $RANDOM;
        case 5: return $RANDOM;
        default: return Getter();
        }
    }else{
        if(0==getter) return $RANDOM;
        else if(1==getter) return $RANDOM;
        else if(2==getter) return $RANDOM;
        else if(3==getter) return $RANDOM;
        else if(4==getter) return $RANDOM;
        else if(5==getter) return $RANDOM;
        else return Getter();
    }
}
EOF
cat > main.c <<EOF
int get(int (Getter)(void));
int Getter(void){ return 42; }
int main(int C, char**V)
{
    if(C==1)
        for(int i=0; i<1000000000;i++)
            get((int(*)(void))4);
    else
        for(int i=0; i<1000000000;i++)
            get(Getter);

}
EOF
: ${CC:=gcc}
arg='-Os -fpic'
for c in *.c; do $CC $arg -c $c; done
$CC get.o -o libget.so -shared
$CC main.o $PWD/libget.so -o dso
$CC main.o get.o -o dso -o static
set -x
time ./dso
time ./dso 1
time ./static
time ./static 1

The timings (relatively stable) are:

+ ./dso

real    0m3.778s
user    0m3.709s
sys 0m0.056s
+ ./dso 1

real    0m3.739s
user    0m3.736s
sys 0m0.000s
+ ./static

real    0m2.478s
user    0m2.477s
sys 0m0.000s
+ ./static 1

real    0m3.425s
user    0m3.411s
sys 0m0.000s

Why do switch jumps perform quite a bit better but only when the function is linked statically?

Disassembly diff (sdiff-generated) of the dynamic and static version respectively:

000000000000111a <get>:                       | 0000000000001180 <get>:
    cmp    $0xc,%rdi                        cmp    $0xc,%rdi
    ja     1178 <get+0x5e>                    |     ja     11de <get+0x5e>
    lea    0xed9(%rip),%rdx        # 2000 <_fini+0xe80>   |     lea    0xe77(%rip),%rdx        # 2004 <_IO_stdin_used
    movslq (%rdx,%rdi,4),%rax                   movslq (%rdx,%rdi,4),%rax
    add    %rdx,%rax                        add    %rdx,%rax
    jmpq   *%rax                            jmpq   *%rax
    mov    $0x132b,%eax                     mov    $0x132b,%eax
    retq                                retq   
    mov    $0x2740,%eax                     mov    $0x2740,%eax
    retq                                retq   
    mov    $0x79b6,%eax                     mov    $0x79b6,%eax
    retq                                retq   
    mov    $0x5234,%eax                     mov    $0x5234,%eax
    retq                                retq   
    mov    $0x6389,%eax                     mov    $0x6389,%eax
    retq                                retq   
    mov    $0x37de,%eax                     mov    $0x37de,%eax
    retq                                retq   
    mov    $0x6a22,%eax                     mov    $0x6a22,%eax
    retq                                retq   
    mov    $0x1a35,%eax                     mov    $0x1a35,%eax
    retq                                retq   
    mov    $0x2ce8,%eax                     mov    $0x2ce8,%eax
    retq                                retq   
    mov    $0x4fed,%eax                     mov    $0x4fed,%eax
    retq                                retq   
    mov    $0xfe3,%eax                      mov    $0xfe3,%eax
    retq                                retq   
    mov    $0x4229,%eax                     mov    $0x4229,%eax
    retq                                retq   
    jmpq   *%rdi                            jmpq   *%rdi
    mov    $0x529e,%eax                     mov    $0x529e,%eax
    retq                                retq   
                                  <

my guess is it's all about cache locality https://stackoverflow.com/questions/16699247/what-is-a-cache-friendly-code and https://en.wikipedia.org/wiki/Locality_of_reference — Sergei Nikulov, Dec 28 '18 at 12:53
My results, using Apple LLVM 10.0.0 clang-1000.11.45.5, show faster results for the `switch` code in both cases. Why would you think there is a general answer not dependent on your compiler? You have not stated what tools you are using. State the compiler and linker versions, use `-S` to generate assembly code, and show the assembly code. — Eric Postpischil, Dec 28 '18 at 12:55
@SergeiNikulov Shouldn't the loop ensure everything's in cache? Couldn't it be in the distance of the jump? — Petr Skocik, Dec 28 '18 at 12:56
@EricPostpischil Thanks. I don't think that. I was even going to tag it [assembly]. I'll do it now. — Petr Skocik, Dec 28 '18 at 12:57
When doing benchmarking, always build *with* optimization. I.e. use something like `-O2` instead of `-Os`. — Some programmer dude, Dec 28 '18 at 12:59
Also, aren't you breaking strict aliasing here (with the argument to the `get` function)? Which technically is UB. — Some programmer dude, Dec 28 '18 at 13:05
@PSkocik one of the statements "Avoid unpredictable branches". Switch definitely predictable structure and can be optimized to table. — Sergei Nikulov, Dec 28 '18 at 13:07
@Someprogrammerdude I'm not calling the casted integer literals so I don't think I am, but I'm not even caring about perfect C conformance here, just the machine behavior. (I believe it's technically UB for a different reason -- the pointers aren't sufficiently aligned for the target type. But again, I'm only asking about the behavior of the machine code.) — Petr Skocik, Dec 28 '18 at 13:07
why he is speaking about a _switch_ while the difference only concern the fact the called function is linked statically or dynamically ? The _switch_ is not relevant here — bruno, Dec 28 '18 at 13:09
@PSkocik indeed, can you prove that it is the switch - doesn't the same thing happen if the function is constant? — Antti Haapala -- Слава Україні, Dec 28 '18 at 13:38
@Someprogrammerdude: I think the compiler is "compressing" the table (storing offsets and adding at runtime instead storing full pointers) not because of `-Os`, but because it's trying to ensure position-independent code+data. (A static jump table would have absolute addresses hardcoded in it.) I don't think that's a signficant source of extra cost here, just a couple cycles or so of extra branch-miss cost before the right address can be found, if a branch mispredicts. Moreover, the asm is the same for both versions. Try with `-no-pie` -fno-pie`. — Peter Cordes, Dec 28 '18 at 13:39
There's a huge missed optimization here: the switch is still implemented as a branch instead of just a table-lookup of the data. https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85585 is another example of gcc missing this optimization, with the slight added complication of string-literal return values. Integers should be even easier to optimize into a table. The cases are contiguous so it would only take one range-check before a table lookup of a return value. — Peter Cordes, Dec 28 '18 at 13:44
@bruno anyway it's about the locality. Either you've called something which is in a .text segment or somewhere physically distanced (even if it is virtually seen as local). — Sergei Nikulov, Dec 28 '18 at 13:44
@AnttiHaapala If you replace the switch with `if(getter==4) return $RANDOM; else return Getter();` you get about the same timing. It is the switch. (If it was dynamic call overhead, I'd expect approx. a constant addition to each timing with the `switch` still winning). — Petr Skocik, Dec 28 '18 at 13:45
@PSkocik then I am guessing that you're branching too deep and spilling items from the branch prediction table or the pipeline stalls or sth. I don't think it is the cache, pretty sure all the code would fit in the innermost cache anyway. i7 for example has got 32k memory in L1. — Antti Haapala -- Слава Україні, Dec 28 '18 at 13:49
@PeterCordes That's a good point but I'm really interested in the cost of the switch-jump vs a function call here. The `return $RANDOM;`s are just placeholders to real short code that wouldn't be optimizable to a lookup table. If it didn't generate a jump on gcc, I'd be asking about a different example that does generate a jump. — Petr Skocik, Dec 28 '18 at 13:52

score 3 · Answer 1 · answered Dec 28 '18 at 13:57

3

The calls can't inline (because you put the definition in a separate file and didn't use link-time optimization).

I think you're measuring the extra overhead of calling through the PLT when calling a function in a shared library, traditional Unix style, which gcc does by default. Use -fno-plt to emit memory-indirect call instructions that use the GOT entry directly, instead of calling a memory-indirect jmp. See Sorry state of dynamic libraries on Linux for more about PLT overhead, or disassemble it yourself. (TODO: add disassembly to this answer.)

I expect -fno-plt will make both versions run almost exactly the same.

The asm for both versions of "get" is identical, modulo different random numbers and different addresses. They presumably perform the same, both slow because gcc misses the optimization of turning the switch into a table lookup. See https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85585 for that and related stuff. (BTW, gcc compresses the table into offsets instead of using a classic jump table of raw pointers because it's trying to avoid absolute addresses everywhere, even as data. Some targets don't support fixups even for that, and gcc currently avoids them even on targets like x86-64/Linux where it would be fine with runtime fixups. But of course it's silly to do an indirect branch instead of just looking up the data in a table in this case.)

Also related: 32-bit absolute addresses no longer allowed in x86-64 Linux? talks some about the cost of -fpie and -fpic. In this case there's nothing to save by omitting -fpic and/or using -fno-pie -no-pie, because separate files are also keeping the function from inlining, not just possible symbol-interposition / ELF symbol visibility.

answered Dec 28 '18 at 13:57

Peter Cordes

328,167
45
605
847

You just guessed it without even trying it out? :D Yes. `-fno-plt` is making both versions run almost exactly the same. Thanks. – Petr Skocik Dec 28 '18 at 14:01
1

@PSkocik: yeah, I know what `-fno-plt` does to the asm, and that a memory-indirect `call get@GOTPCREL(%rip)` will perform identically to a direct `call rel32` in a microbenchmark where the branch predictors are always hot, on modern Intel and AMD CPUs. – Peter Cordes Dec 28 '18 at 14:09
1

@PSkocik: BTW, your question title doesn't make sense. You're not comparing switch-jump vs. a function call. You have the same amount of function calls in both cases. You may have written this to compare switch vs. an if chain or tree (again which isn't a function call), but now you're comparing identical code with the only difference being call within an executable (but across files) vs. calling a function in a shared library. Using LTO `gcc -O3 -flto` to compile and link will make the single-executable version much faster because get can inline instead of actually calling. – Peter Cordes Dec 28 '18 at 14:13
I don't quite understand why the PLT indirection is slowing down the switch cases without the call to the function pointer. Shouldn't it only slow down the `default:` case thus making the first version with the DSO perform even more favorably relative to the second? (Perhaps I'll understand after I've gone through the hyperlinks.) – Petr Skocik Dec 28 '18 at 14:17
1

@PSkocik: It's slowing down the call from the loop in `main` to `get`. Either way of compiling, the caller passes the real function pointer (not a pointer to a PLT stub) as an argument, so a call from inside the switch doesn't also go through the PLT. – Peter Cordes Dec 28 '18 at 14:31
With the direct call to get, `get(4)` is faster than `get(Getter)`. What I don't understand is why with the PLT indirection the `get(4)` and `get(Getter)` take about the same. Shouldn't each duration increase like by a constant, maintaining `get(4)` as the winner? – Petr Skocik Dec 28 '18 at 14:37
1

@PSkocik: I'll have to sleep on that and maybe try it myself, not sure why that's the case. If you can check perf counters to see if there are any branch misses, that'd be a good idea. (`perf stat ./static 1` and so on.) – Peter Cordes Dec 28 '18 at 14:41

Why are gcc's switch-generated jumps faster than equivalent function calls but only with static linking?

1 Answers1