0

I have multiple small functions marked inline that accept an array of intrinsic __m256 vectors as parameters. When i compile using clang-15 the generated assembly is clean. (I use -save-temps to explore assembly) When i compile using gcc-12 the performance is ~10x worse if using -O3 and ~4x worse if using -O2. After inspecting the assembly i've found that gcc does not inline small functions and moves data contained in __m256 vectors to stack to pass, which i assume is the main reason for performance difference.

I've tried to use -finline-limit=100000, and while it helped a lot, the performance is still worse, and -O2 still outperforms -O3.

I am also using constructs like std::tie(a,b) = foo(a,b); where a and b are __m256 vectors. When compiling with gcc 12.1 calls call _ZNSt11_Tuple_impl... are generated, while with clang-15 or gcc 11.3 they are not.

What is the correct way of controlling inlining with gcc?

deezo
  • 11
  • 2
  • 1
    Seems more like a compiler bug, or some other issue that disables inlining in such cases. Generally, functions marked inline don't actually enforce inlining. There is also `__attribute__((always_inline))` compiler specific command for inlining more forcefully (GCC/Clang); see https://stackoverflow.com/questions/2765164/inline-vs-inline-vs-inline-vs-forceinline – ALX23z Apr 13 '23 at 10:42
  • 2
    I don't think there is a one-size-fits-all solution that would work without examining what went wrong with your code. [mre]? If you decide to report this as a bug to gcc, you will be asked for a reproducer anyway. – teapot418 Apr 13 '23 at 10:47
  • I assume you mean an array like `__m256i foo[]` or something? An "array of registers" is a contradiction: registers don't have addresses in memory. So the elements of a `__m256i[]` can only actually stay in registers if the compiler optimizes away the array after inlining, like scalar replacement of the aggregate (array). Or to put it another way, `__m256i` is a C++ vector type that fits in a vector register, but it's not accurate to call it a register. Just like `int` is a type that fits in an integer register, but doesn't always live there. – Peter Cordes Apr 13 '23 at 10:47
  • I want to avoid using this attribute. I've tried adding it to the problematic function, but the resulting assembly still contaned calls to the functions above/below the problematic function in logical call-stack, so i would need to add a lot of attributes. – deezo Apr 13 '23 at 10:53
  • @PeterCordes I pass std::array<__m256,N> as syntatic sugar instead of passing N registers. I assumed that after inlining it will just be equivalent to straight use of registers. I am using "registers" meaning intrinsics virtual type like __m256 or__m256d . – deezo Apr 13 '23 at 10:56
  • When talking about how C++ compiles to asm and whether `__m256` gets spilled/reloaded or not, it's best to use precise terminology. A `__m256` isn't "a register", it's a vector, or "SIMD vector". It's a C++ object; whether and when it's in a register or not is up to the compiler, since you're not writing asm by hand. – Peter Cordes Apr 13 '23 at 11:00
  • Sorry for the incorrect terminology, i've edited my question to reflect your corrections. – deezo Apr 13 '23 at 11:01
  • There does not seem to be a problem for very simple functions: https://godbolt.org/z/6rjWjjor9. You need to provide a [mre]! – chtz Apr 13 '23 at 21:32

0 Answers0