GCC - Function inlining, LTO and optimizations

Question

Having a code like this:

#include "kernel.h"
int main() {
    ...
    for (int t = 0; t < TSTEPS; ++t) {
       kernel(A,B,C);
    }
    ...
}

Where:

// kernel.h
void kernel(float *__restrict A, float *__restrict B, float *__restrict C);

// kernel.c
#include "kernel.h"

void kernel(float *__restrict A, float *__restrict B, float *__restrict C) {
    // some invariant code
    float tmp0 = B[42];
    float tmp1 = C[42];
    // some operations with tmpX, e.g.
    A[0] += tmp0 * tmp1;
}

The idea is to compile independently kernel, since I need to apply a set of optimizations that I am not interested in the main program. Besides, I do not want any other kind of loop nor inter/intra-procedural optimizations: I just want to inline exactly the result of the compilation for kernel onto the call to kernel in main. I have tried many different things (giving hints with inline, __attribute__((always_inline)), etc., but the only way to inline is:

gcc -c -O3 -flto kernel.c
gcc -O1 -flto kernel.o main.c

Producing the following assembly code for kernel:

kernel:
.LFB0:
    .cfi_startproc
    endbr64
    vxorps  %xmm1, %xmm1, %xmm1
    vcvtss2sd   168(%rsi), %xmm1, %xmm0
    vcvtss2sd   168(%rdx), %xmm1, %xmm2
    vcvtss2sd   (%rdi), %xmm1, %xmm1
    vfmadd132sd %xmm2, %xmm1, %xmm0
    vcvtsd2ss   %xmm0, %xmm0, %xmm0
    vmovss  %xmm0, (%rdi)
    ret
    .cfi_endproc

And where the kernel call should be in main, code generated is:

...
    1092:   f3 0f 10 0d 76 0f 00    movss  0xf76(%rip),%xmm1        # 2010 <_IO_stdin_used+0x10>
    1099:   00 
    109a:   f3 0f 10 00             movss  (%rax),%xmm0
    109e:   b8 10 27 00 00          mov    $0x2710,%eax
    10a3:   0f 1f 44 00 00          nopl   0x0(%rax,%rax,1)
    10a8:   f3 0f 58 c1             addss  %xmm1,%xmm0
    10ac:   83 e8 01                sub    $0x1,%eax
    10af:   75 f7                   jne    10a8 <main+0x28>
    10b1:   48 8d 35 4c 0f 00 00    lea    0xf4c(%rip),%rsi        # 2004 <_IO_stdin_used+0x4>
    10b8:   bf 01 00 00 00          mov    $0x1,%edi
    10bd:   b8 01 00 00 00          mov    $0x1,%eax
    10c2:   f3 0f 5a c0             cvtss2sd %xmm0,%xmm0
...

This is clever, of course, and probably the point of LTO. Nonetheless, I would like to get rid of any kind of optimization, but only inline those compiled independently functions. Is there any "formal" way of doing this besides writing it by hand? Compiling main with -O0 does not inline at all, not even with -finline-functions. I have also tried "denying" all optimization flags introduced by -O1, but I am not able to turn off link-time optimizations. These results are obtained either for gcc 9.3.1 and gcc 10.2.0 (minor differences between them for this test).

EDIT 0:

Two more details:

With ICC using a similar approach (IPO, inlining flags, etc.), I obtain similar results, i.e., inlining + optimizations. I have not tried Clang yet.
The code above, the inlining of kernel onto main, is just basically obviating the load of tmp0 and tmp1, and just adding the result of its multiplication to a[0]; I am aware that is clever, but I do not want it, I want to keep original code form.

What real problem are you trying to solve here? Benchmarking? Normally nobody wants worse asm that isn't optimized for the call-site / args, and there isn't a way to just make GCC do what you're asking. So this seems to be an XY problem, so what do you really want? — Peter Cordes, Mar 13 '21 at 05:29
Also, `cvtss2sd 168(%rdx), %xmm1` doesn't seem to match your source; your function args are `double*` but GCC is emitting float->double conversion instructions. This looks like you had all the args being `float*`, but doing math on `double` temporaries. (And then you fixed that in your source but didn't update the asm.) — Peter Cordes, Mar 13 '21 at 05:31
@PeterCordes yes, indeed, I am trying to benchmark some codes, basically. Do not get me wrong, I am aware that nobody wants worse performance, I was just asking if there is a way to control how optimizations are applied. And yes, I forgot to update the C code, thanks. — horro, Mar 13 '21 at 14:13

score 4 · Answer 1 · answered Mar 12 '21 at 19:01

Inlining is usually happening at the IR (Intermediate Representation) or bytecode level. What that means is that it is performed on abstract machine-independent (to a certain degree) representation of the source code. It is then followed up by other optimization passes, which will take advantage of having the code inlined. It is one of the major benefits of inlining.

Inlining at the assembly level, without any optimizations and even more so, keeping function body (assembly) exactly the way it is would be rather awkward due to register allocation and stack management concerns. It might still be slightly beneficial (due to removal of the call; and possibly due to register allocation having additional information on the registers used, less likely to allocate non-volatile regs), but it is highly unlikely that any compilers have an option to do it this way. It would require a special inlining pass that would happen literally in the backend (due to requirement to keep assembly as is).

What you could do: If you really want kernel to be exactly a certain way in assembly - write your kernel function using assembly (as an option: inline assembly). If your problem is really something else (such as compiler optimizing a computation or a load where you don't want to) - there may be other solutions to that.

This a good explanation. I was not aware of those very important details, even though, do you have any sources available to delve deeper onto this topic? — horro, Mar 13 '21 at 14:25

score 3 · Accepted Answer · answered Mar 13 '21 at 15:45

There's no option to make GCC do what you want; that wouldn't be useful for performance of real programs. (Only possibly for benchmarking.)

If you want the inlined version to optimize about the same as the stand-alone version, you need to defeat any constant-propagation into args, and stuff like that. Perhaps hide things from the compiler by storing them into volatile local vars and pass those to the function.

That doesn't guarantee identical asm, but it should be similar enough for benchmarking purposes. Of course if you want to do this inside another loop, volatile would mean extra loads from memory. So you might just want inline asm like asm("" : "+g"(var)) to make the compiler forget anything it knows about the variable's value, and materialize the value in a register or memory of the compiler's choice. (With clang, probably pick "+r" because it likes to use memory for no reason)

This may not stop the compiler from hoisting loop-invariant work out of the loop after inlining, though. To defeat that, you may need similar DoNotOptimize escapes or asm volatile stuff inside the function itself to let it inline without defeating the benchmark. (call/ret are really pretty cheap, so it's not unreasonable to try just not letting it inline, although that can create more overhead at the callsite, and it might need to save/restore some registers.)

Or just construct a test-case that realistically reflects your real use-case, including what surrounding code out-of-order execution can overlap this with.

`asm("" : "+g"(var))` is a very good "hack" in my case. I would never ever get to that. Awesome. — horro, Mar 15 '21 at 09:18
@horro: note that if you want to force the compiler to materialize the var (in a register) even if it's not use later, you'd want `asm volatile`. (And to force it to materialize with value *without* telling the compiler that your asm rewrites it, `asm volatile("" :: "r"(var))` like some definitions of DoNotOptimize functions use. [this](//stackoverflow.com/q/44562871) or [I don't understand the definition of DoNotOptimizeAway](//stackoverflow.com/q/52203710)). The portable equivalent is assigning to a `volatile int foo` and re-reading from that, but that causes store-forwarding latency. — Peter Cordes, Mar 15 '21 at 10:30
Also ["Escape" and "Clobber" equivalent in MSVC](https://stackoverflow.com/q/33975479) has the working GNU C versions, and links a good CppCon talk by Chandler Carruth (a clang developer) about microbenchmarking with `perf` which demos how you use them. — Peter Cordes, Mar 15 '21 at 10:31

GCC - Function inlining, LTO and optimizations

2 Answers2