6

Sample code on Compiler Explorer: https://godbolt.org/g/fPfw4k

I was attempting to use an array of function pointers as a jump table instead of switches as I found it to be cleaner. However, to my surprise, neither GCC nor Clang compiler seems capable of inlining this.

Is there a specific reason why?

Example code incase of dead link:

namespace{
template<int N>
int bar(){
    return N;
}

int foo1(int n){
     if(n < 0 || n > 5){
        __builtin_unreachable();
    }
    #if __clang__
    __builtin_assume(n >= 0 && n <= 5);
    #endif
    static int (* const fns[])() = {
        bar<0>, bar<1>, bar<2>, bar<3>, bar<4>, bar<5>
    };
    return fns[n]();
}

int foo2(int n){
    #if __clang__
    __builtin_assume(n >= 0 && n <= 5);
    #endif
    switch(n){
        case 0:
            return bar<0>();
        case 1:
            return bar<1>();
        case 2:
            return bar<2>();
        case 3:
            return bar<3>();
        case 4:
            return bar<4>();
        case 5:
            return bar<5>();
        default:
            __builtin_unreachable();
    }
}
}

int main(int argc, char** argv){
    volatile int n = foo1(argc);
    volatile int p = foo2(argc);
}

Using the language extension attribute always_inline provided by GCC & Clang makes no difference either.

  • 1
    A lookup table is on average more efficient that 5 test and branch operations on a processor that supports pipelining perhaps? – Richard Hodges Jun 18 '17 at 11:09
  • 1
    It could be my lack of knowledge on this subject, but where would you want the functions to be inlined? You specifically set function pointers to functions which you want to be resolved in runtime. In the switch case, the switch resolves the wanted action and the function is inlined in the case code block. – stefaanv Jun 18 '17 at 11:20
  • @richard hodges, would e.g using a map of ints as key and std:: function be better? – dani Jun 18 '17 at 11:21
  • 1
    @dani no, you'd be trading constant time lookup for logarithmic. You'd be trading down. Lookup tables cost exactly one memory fetch. They're extremely cheap, particularly if the destination code is already in cache. – Richard Hodges Jun 18 '17 at 11:24
  • In this case you could use a vector. – dani Jun 18 '17 at 11:37
  • 1
    @dani both approaches above are equivalent to using a vector. If you look at the optimised assembler output, you'll see this. – Richard Hodges Jun 18 '17 at 11:46
  • If you want your code to be `inlined`, then you would not write that way. You would use more template tricks and `constexpr`. However, since you input is not constant (i.e `argc`), at some point you have to select appropriate code to run. Obviously, if you want to return `argc` anyway, you would write `int n = argc;`. By the way, what is the purpose of `volatile` in your code? **Don't expect the compiler to check all functions and see that it could always returns the value directly.** No one would write such code if that would be the case. – Phil1970 Jun 18 '17 at 12:21
  • Did you look at https://stackoverflow.com/questions/8381293/how-do-i-force-gcc-to-inline-a-function – suroh Jun 18 '17 at 12:51
  • 1
    If you remove the foo1 code, it does become a lookup table. If we would calculate the value of n by adding sufficient constexpr, it gets optimized. So, I suspect there are simply some missing optimization passes or some unexpected ordering. – JVApen Jun 18 '17 at 13:49
  • 1
    I think @JVApen is correct, except for that the code isn't optimized even with `constexpr`s everywhere. What I suspect is happening is that the compiler tries to inline template functions before array accesses are inlined into lookups, and doesn't bother trying to inline template functions again after the array has been inlined. This might be a mistake, or it might be a speed/performance (performance as in quality of the result; the optimized program) tradeoff. – yyny Jul 06 '17 at 12:24

1 Answers1

1

The compiler cannot inline the call in foo1 because the call does not use a compile-time constant callee. If it knows that a constant argument was passed to foo1 at compile time by inlining it, it will inline the correct function.

Consider this example:

namespace{
template<int N>
int bar(){
    return N;
}

int foo1(int n){
     if(n < 0 || n > 5){
        __builtin_unreachable();
    }
    #if __clang__
    __builtin_assume(n >= 0 && n <= 5);
    #endif
    static int (* const fns[])() = {
        bar<0>, bar<1>, bar<2>, bar<3>, bar<4>, bar<5>
    };
    return fns[n]();
}
}

int main(int argc, char** argv){
    int n = foo1(3);

    return n;
}

It is compiled to the following code by both compilers:

main:
        mov     eax, 3
        ret

In the case of foo2, the compiler starts out with 5 different calls with constant callees, all of which it inlines. Then it optimizes the resulting code further, generating its own jump table if it considers it profitable.

I guess the compiler could try to extract a switch from your jump table and then inline everything, but this would be quite complex and very unlikely to yield a performance improvement in the general case, so neither gcc nor clang seems to do this.

PaulR
  • 3,587
  • 14
  • 24