Having a code like this:
#include "kernel.h"
int main() {
...
for (int t = 0; t < TSTEPS; ++t) {
kernel(A,B,C);
}
...
}
Where:
// kernel.h
void kernel(float *__restrict A, float *__restrict B, float *__restrict C);
// kernel.c
#include "kernel.h"
void kernel(float *__restrict A, float *__restrict B, float *__restrict C) {
// some invariant code
float tmp0 = B[42];
float tmp1 = C[42];
// some operations with tmpX, e.g.
A[0] += tmp0 * tmp1;
}
The idea is to compile independently kernel
, since I need to apply a set of optimizations that I am not interested in the main
program. Besides, I do not want any other kind of loop nor inter/intra-procedural optimizations: I just want to inline exactly the result of the compilation for kernel
onto the call to kernel
in main
. I have tried many different things (giving hints with inline
, __attribute__((always_inline))
, etc., but the only way to inline is:
gcc -c -O3 -flto kernel.c
gcc -O1 -flto kernel.o main.c
Producing the following assembly code for kernel
:
kernel:
.LFB0:
.cfi_startproc
endbr64
vxorps %xmm1, %xmm1, %xmm1
vcvtss2sd 168(%rsi), %xmm1, %xmm0
vcvtss2sd 168(%rdx), %xmm1, %xmm2
vcvtss2sd (%rdi), %xmm1, %xmm1
vfmadd132sd %xmm2, %xmm1, %xmm0
vcvtsd2ss %xmm0, %xmm0, %xmm0
vmovss %xmm0, (%rdi)
ret
.cfi_endproc
And where the kernel
call should be in main
, code generated is:
...
1092: f3 0f 10 0d 76 0f 00 movss 0xf76(%rip),%xmm1 # 2010 <_IO_stdin_used+0x10>
1099: 00
109a: f3 0f 10 00 movss (%rax),%xmm0
109e: b8 10 27 00 00 mov $0x2710,%eax
10a3: 0f 1f 44 00 00 nopl 0x0(%rax,%rax,1)
10a8: f3 0f 58 c1 addss %xmm1,%xmm0
10ac: 83 e8 01 sub $0x1,%eax
10af: 75 f7 jne 10a8 <main+0x28>
10b1: 48 8d 35 4c 0f 00 00 lea 0xf4c(%rip),%rsi # 2004 <_IO_stdin_used+0x4>
10b8: bf 01 00 00 00 mov $0x1,%edi
10bd: b8 01 00 00 00 mov $0x1,%eax
10c2: f3 0f 5a c0 cvtss2sd %xmm0,%xmm0
...
This is clever, of course, and probably the point of LTO. Nonetheless, I would like to get rid of any kind of optimization, but only inline those compiled independently functions. Is there any "formal" way of doing this besides writing it by hand? Compiling main
with -O0
does not inline at all, not even with -finline-functions
. I have also tried "denying" all optimization flags introduced by -O1
, but I am not able to turn off link-time optimizations. These results are obtained either for gcc 9.3.1
and gcc 10.2.0
(minor differences between them for this test).
EDIT 0:
Two more details:
- With ICC using a similar approach (IPO, inlining flags, etc.), I obtain similar results, i.e., inlining + optimizations. I have not tried Clang yet.
- The code above, the inlining of
kernel
onto main, is just basically obviating the load oftmp0
andtmp1
, and just adding the result of its multiplication toa[0]
; I am aware that is clever, but I do not want it, I want to keep original code form.