Understanding GCC inlining

Question

I have been trying to refactor some low-level code of moderate size, and I can't say I am too happy with the way compiler optimizers are inlining code.

I don't really understand how gcc inlines code, but for my one particular case, I am getting run-time speed equivalent to hand-written code in gcc 8.2.1 by using these options:

-std=c++17 -Winline 
-Ofast -march=native -DNDEBUG 
-finline-limit=100000 --param large-function-insns=10000 --param large-stack-frame-growth=1000 
--param inline-unit-growth=1000 --param early-inlining-insns=150 --param max-early-inliner-iterations=1000
-fopenmp -fPIC

Without the inline options, my program is 3 times slower. I would have expected a more simple option to tell the compiler "trust me, when I say inline, you MUST inline it". Is there such a compiler option?

Notes:

Some details about the code: there are 3 nested for loops, the third one being SIMD, each iteration is computing complex fixed-size linear algebra stuff. The linear algebra stuff itself is not SIMD (since the above loop is). Most of the abstraction deals with multi-dimensional arrays and dense linear algebra (which needs expression templates).
All the functions that I want inline are defined in the compilation unit. I have no recursive nor virtual functions, I don't throw exceptions. My functions are constexpr rather than inline, bu constexpr implies implicit inline. There is no 3rd-party library calls (all calls are math functions such as std::sqrt). There is no parallelism except SIMD.
At the beginning of the refactoring, when there is still few functions to inline, there is absolutely no problem. But as I am adding more and more inline functions to abstract the code, the compiler begins to struggle with inlining (and other things such as SROA it seems).
I have a lot of tiny little functions to inline. I don't define functions for the sake of it, but I do need to define a lot of them in order to be generic.
I am working on a realistic test case for performance measuring. This is not a micro benchmark, so I am confident that I am measuring what I really want to measure.
If a function in my hot loop is not inlined, I do measure a x2 performance penalty (certainly due to the fact that it prevents a lot of further optimizations, in particular with vectorization and SROA)
I began to work on this with the intel compiler, which is excruciatingly slow and buggy with template code. Keeping hand-written performance was very complicated, so I switched to gcc.

Now I noticed some strange behavior:

In some cases, not using -fPIC made gcc issue a -Winline warning saying it did not inline. I don't understand the relation between -fPIC and inlining whatsoever.
I don't understand the need to specify early inlining passes for gcc. I would have thought that --param early-inlining-insns=150 should only be used to optimize compilation time, not the code generated by gcc. But the fact is that if the value is 50 I get a silent bad inlining (no warning by gcc), and if the value is 1000 I also get bad inlining (gcc warns me this time). What is going on?
I am a bit reluctant to use __attribute__((always_inline)) because that would be ugly to do it for every little function, but it seems to me that even with this attribute, gcc sometimes does not inline the function. Does gcc really always inline functions with this attribute?

How do I force gcc to inline all my inline functions? I don't understand, even conceptually, why the compiler has such a hard time inlining when it seems so simple to do by hand. Are there scalability optimization problems with inlining?

Ohh, so it means that you rely on auto vectorization (from my experience it is worthless so I rely intrinsics instead)? Also where did you get those numbers to tune those settings? — user7860670, Sep 23 '18 at 20:56
@Drew Dormann The "duplicate" question answers about a 1/20 of mine... — Bérenger, Sep 23 '18 at 20:56
@VTT I would not call that auto vectorization. The pragma *requires* the compiler to vectorize the code to be openmp compliant. Anyway, it does vectorize. — Bérenger, Sep 23 '18 at 20:58
"why the compiler has such a hard time inlining": excessive inlining usually makes the program slower. Every compiler has heuristics when to inline/not to inline. `always_inline` should inline always. The only exception I could imagine when the code flow is complex: functions calling each other in a recursive manner. In this case, the compiler may not be able to transform recursion to a loop. — geza, Sep 23 '18 at 20:58
@VTT Those numbers were to be sure the code was below the limit — Bérenger, Sep 23 '18 at 21:00
@geza I agree but you have to understand that in my case, the functions barely do anything. At least 90% of them are 1 liners. Plus, my benchmarks clearly indicates I need them inline — Bérenger, Sep 23 '18 at 21:01
@geza But is there no way to tell GCC "replace `inline` by `__attribute__((always_inline))`? " — Bérenger, Sep 23 '18 at 21:03
Maybe, after inlining some functions, your function becomes larger than the inline code-size limit. Try to apply `force_inline` to all your functions, it shouldn't be that hard. If GCC still doesn't inline, then put a reproducible sample here please. — geza, Sep 23 '18 at 21:04
Let me state the obvious... Benchmark results were *not* provided with and without `-Winline`. The best I can tell OP never states or provides evidence `-Winline` is the reason for the speedup. I tend to agree with Drew. Options like `-Ofast`, -`march=native` and `-fopenmp` are probably driving the improvements. — jww, Sep 23 '18 at 21:06
I don't think so. I use a macro for this, when I really want a function to be inlined: `#define FORCE_INLINE ...`. And I use this macro instead of `inline`. — geza, Sep 23 '18 at 21:07
@jww with only `-Ofast -march=native -DNDEBUG -fopenmp -fPIC`, x3 slower. I just did the test. — Bérenger, Sep 23 '18 at 21:09
@jww Thanks for pointing out the ambiguity, I edited the OP to be clearer — Bérenger, Sep 23 '18 at 21:11
@jww And just to be clear, -Winline just warns if a function asked to be inline is really inlined by the compiler — Bérenger, Sep 23 '18 at 21:12
@geza To be completely sure: so your macro is `#define FORCE_INLINE __attribute__((always_inline))`? — Bérenger, Sep 23 '18 at 21:16
`#define FORCE_INLINE inline __attribute__((always_inline))` (As far as I remember, the `inline` still have to be specified). This is for GCC. MSVC has `__forceinline`. — geza, Sep 23 '18 at 21:19
@geza OK thanks i will try it (it's a bit daunting but since nobody seems to know what gcc is really doing...) — Bérenger, Sep 23 '18 at 21:22
Even if they knew, I wouldn't use that information for your case. These switches are used to adjust the parameters of the heuristics. In your case, you don't need this. You don't want to adjust heuristics, but you want to tell the compiler: "inline this, no matter what". This is `always_inline` for. — geza, Sep 23 '18 at 21:27
@geza Yes it makes sense. I thought I tried to force_inline and it didn't work so I fell back on these heuristics. But the reason it didn't work may have been because I wasn't exhaustive regarding other functions. I will try to force_inline *all* the small functions then come back here to give the result. Thanks — Bérenger, Sep 23 '18 at 21:32
Also see [GCC recommendations and options for fastest code](https://stackoverflow.com/q/3005564/608639) and [GCC optimization flags for matrix/vector operations](https://stackoverflow.com/q/16064288/608639) — jww, Sep 23 '18 at 21:35

Understanding GCC inlining

0 Answers0