I have been trying to refactor some low-level code of moderate size, and I can't say I am too happy with the way compiler optimizers are inlining code.
I don't really understand how gcc inlines code, but for my one particular case, I am getting run-time speed equivalent to hand-written code in gcc 8.2.1 by using these options:
-std=c++17 -Winline
-Ofast -march=native -DNDEBUG
-finline-limit=100000 --param large-function-insns=10000 --param large-stack-frame-growth=1000
--param inline-unit-growth=1000 --param early-inlining-insns=150 --param max-early-inliner-iterations=1000
-fopenmp -fPIC
Without the inline options, my program is 3 times slower. I would have expected a more simple option to tell the compiler "trust me, when I say inline, you MUST inline it". Is there such a compiler option?
Notes:
- Some details about the code: there are 3 nested for loops, the third one being SIMD, each iteration is computing complex fixed-size linear algebra stuff. The linear algebra stuff itself is not SIMD (since the above loop is). Most of the abstraction deals with multi-dimensional arrays and dense linear algebra (which needs expression templates).
- All the functions that I want inline are defined in the compilation unit. I have no recursive nor virtual functions, I don't throw exceptions. My functions are constexpr rather than inline, bu constexpr implies implicit inline. There is no 3rd-party library calls (all calls are math functions such as std::sqrt). There is no parallelism except SIMD.
- At the beginning of the refactoring, when there is still few functions to inline, there is absolutely no problem. But as I am adding more and more inline functions to abstract the code, the compiler begins to struggle with inlining (and other things such as SROA it seems).
- I have a lot of tiny little functions to inline. I don't define functions for the sake of it, but I do need to define a lot of them in order to be generic.
- I am working on a realistic test case for performance measuring. This is not a micro benchmark, so I am confident that I am measuring what I really want to measure.
- If a function in my hot loop is not inlined, I do measure a x2 performance penalty (certainly due to the fact that it prevents a lot of further optimizations, in particular with vectorization and SROA)
- I began to work on this with the intel compiler, which is excruciatingly slow and buggy with template code. Keeping hand-written performance was very complicated, so I switched to gcc.
Now I noticed some strange behavior:
- In some cases, not using
-fPIC
made gcc issue a-Winline
warning saying it did not inline. I don't understand the relation between-fPIC
and inlining whatsoever. - I don't understand the need to specify early inlining passes for gcc. I would have thought that
--param early-inlining-insns=150
should only be used to optimize compilation time, not the code generated by gcc. But the fact is that if the value is50
I get a silent bad inlining (no warning by gcc), and if the value is1000
I also get bad inlining (gcc warns me this time). What is going on? - I am a bit reluctant to use
__attribute__((always_inline))
because that would be ugly to do it for every little function, but it seems to me that even with this attribute, gcc sometimes does not inline the function. Does gcc really always inline functions with this attribute?
How do I force gcc to inline all my inline
functions? I don't understand, even conceptually, why the compiler has such a hard time inlining when it seems so simple to do by hand. Are there scalability optimization problems with inlining?