1

Why does using the extern inline __attribute__((gnu_inline)) over static inline affects GCC 8.3 code generation so much?

The example code is based on glibc bsearch code (build with -O3):

#include <stddef.h>

extern inline __attribute__((gnu_inline))
void *bsearch (const void *__key, const void *__base, size_t __nmemb, size_t __size,
   int (*__compar)(const void *, const void *))
{
    size_t __l, __u, __idx;
    const void *__p;
    int __comparison;

    __l = 0;
    __u = __nmemb;
    while (__l < __u) {
        __idx = (__l + __u) / 2;
        __p = (void *) (((const char *) __base) + (__idx * __size));
        __comparison = (*__compar) (__key, __p);
        if (__comparison < 0)
            __u = __idx;
        else if (__comparison > 0)
            __l = __idx + 1;
        else
            return (void *) __p;
    }

  return NULL;
}

static int comp_int(const void *a, const void *b)
{
    int l = *(const int *) a, r = *(const int *) b;
    if (l > r) return 1;
    else if (l < r) return -1;
    else return 0;
}

int *bsearch_int(int key, const int *data, size_t num)
{
    return bsearch(&key, data, num, sizeof(int), &comp_int);
}

The code generated for the bsearch_int function is:

bsearch_int:
        test    rdx, rdx
        je      .L6
        xor     r8d, r8d
.L5:
        lea     rcx, [rdx+r8]
        shr     rcx
        lea     rax, [rsi+rcx*4]
        cmp     DWORD PTR [rax], edi
        jl      .L3
        jg      .L10
        ret
.L10:
        mov     rdx, rcx
.L4:
        cmp     rdx, r8
        ja      .L5
.L6:
        xor     eax, eax
        ret
.L3:
        lea     r8, [rcx+1]
        jmp     .L4

If I use static inline over extern inline __attribute__((gnu_inline)) I get much larger code:

bsearch_int:
        xor     r8d, r8d
        test    rdx, rdx
        je      .L11
.L2:
        lea     rcx, [r8+rdx]
        shr     rcx
        lea     rax, [rsi+rcx*4]
        cmp     edi, DWORD PTR [rax]
        jg      .L7
        jl      .L17
.L1:
        ret
.L17:
        cmp     r8, rcx
        jnb     .L11
        lea     rdx, [r8+rcx]
        shr     rdx
        lea     rax, [rsi+rdx*4]
        cmp     edi, DWORD PTR [rax]
        jg      .L12
        jge     .L1
        cmp     r8, rdx
        jnb     .L11
.L6:
        lea     rcx, [r8+rdx]
        shr     rcx
        lea     rax, [rsi+rcx*4]
        cmp     DWORD PTR [rax], edi
        jl      .L7
        jle     .L1
        mov     rdx, rcx
        cmp     r8, rdx
        jb      .L6
.L11:
        xor     eax, eax
.L18:
        ret
.L12:
        mov     rax, rcx
        mov     rcx, rdx
        mov     rdx, rax
.L7:
        lea     r8, [rcx+1]
        cmp     r8, rdx
        jb      .L2
        xor     eax, eax
        jmp     .L18

What makes GCC generate so much shorter code in the first case?

Notes:

  • Clang does not seem to be affected by this.
  • For a semantics-level guide to inlining cases, see also https://stackoverflow.com/questions/216510/extern-inline/51229603#51229603 – o11c Apr 27 '19 at 21:30

2 Answers2

0

It only compiles because you do not use any optimisations and inlining is not active. Try with -O1 for example and your code will not compile at all.

The code is different because when you use static the compiler does not have to care about the calling conventions as the function will be not visible to another compilation units.

0___________
  • 60,014
  • 4
  • 34
  • 74
  • I compile with `-O3`, inlining works as a charm. That is also why the compiler doesn't need to care about calling convention anyway - the only function in the output is `bsearch_int` which has no GCC-specific attributes. And code compiles find with all optimization levels. I moved the link to godbolt.org closer to the start of the question to make more clear. –  Apr 27 '19 at 20:49
0

The answer below was based on revision 2 of the question, whereas revision 3 changed, based on this answer, the meaning of the question, after which much of the answer below can seem a bit out of context. Leaving this answer as it was written, based on edition 2.


From 6.31.1 Common Function Attributes of GCC's manual [emphasis mine]:

gnu_inline

This attribute should be used with a function that is also declared with the inline keyword. It directs GCC to treat the function as if it were defined in gnu90 mode even when compiling in C99 or gnu99 mode.

...

And, from Section 6.42 An Inline Function is As Fast As a Macro [emphasis mine]:

When a function is both inline and static, if all calls to the function are integrated into the caller, and the function's address is never used, then the function's own assembler code is never referenced. In this case, GCC does not actually output assembler code for the function, unless you specify the option -fkeep-inline-functions.

...

The remainder of this section is specific to GNU C90 inlining.

When an inline function is not static, then the compiler must assume that there may be calls from other source files; since a global symbol can be defined only once in any program, the function must not be defined in the other source files, so the calls therein cannot be integrated. Therefore, a non-static inline function is always compiled on its own in the usual fashion.

If you specify both inline and extern in the function definition, then the definition is used only for inlining. In no case is the function compiled on its own, not even if you refer to its address explicitly. Such an address becomes an external reference, as if you had only declared the function, and had not defined it.

...

They key here is that the gnu_inline attribute will only have an effect on the following two cases, where GNU C90 inlining will apply:

  • using both extern and inline, and
  • only using inline.

As expected, we see a large difference in the generated assembly between these two.

When using static and inline, however, the GNU C90 inlining rules do not apply (or rather, does not specifically cover this case), which means the gnu_inline attribute will not matter.

Indeed, these two signatures results in the same assembly:

static inline __attribute__((gnu_inline))
void *bsearch ...

static inline
void *bsearch ...

As extern inline and static inline are using two different inlining approaches (GNU C90 inlining strategy and more modern inlining strategies, respectively) it can be expected that the generated assembly may differ slightly between these two. Nonetheless, both these yield substantially less assembly output than when using only inline (in which case, as cited above, the function is always compiled on its own).

dfrib
  • 70,367
  • 12
  • 127
  • 192
  • In neither of those cases the function body is generated, the only function in the output is `bsearch_int`. Are you implying that there is a separate code generator for GNU C90 inlining and it somehow conveys more information to the optimizer? I expected the output to be the same because the optimizer would do the same job anyway and minor differences usually do not matter (although in my case Clang output changes noticeable even if variable declaration is moved to the outer scope, so I can expect random changes in the output). –  Apr 27 '19 at 21:04
  • @StaceyGirl possibly, yes. Note that, peculiarly, the identical assembly is generated for the `static` case if we remove `inline __attribute__((gnu_inline))` and mark `bsearch` only as `static`, which leads me to believe that there are two totally different optimization schemes here, one where the compiler make use of our information (`extern inline __attribute__((gnu_inline))`) combined with GNU C90 inlining, and one where it entirely ignores our `inline` hint (`static inline` and `static` yields the same result, so the optimizer tries to figure out things itself rather than by our cue). – dfrib Apr 27 '19 at 21:07
  • I rephrased the question to make it clear I am interested in the inlining procedure here. Lets see if somebody can provide more information on why this happens. –  Apr 27 '19 at 21:14
  • @StaceyGirl Not that your edit somewhat changes the meaning of the whole post (after the fact that answers have been given), and most of my answer here is now moot. But I'm also interested, so let's see if someone can provide more details. – dfrib Apr 27 '19 at 21:17
  • Yeah, sorry for that. I though I was clear enough the first time. –  Apr 27 '19 at 21:22