I have multiple small functions marked inline that accept an array of intrinsic __m256
vectors as parameters.
When i compile using clang-15 the generated assembly is clean. (I use -save-temps to explore assembly)
When i compile using gcc-12 the performance is ~10x worse if using -O3 and ~4x worse if using -O2. After inspecting the assembly i've found that gcc does not inline small functions and moves data contained in __m256
vectors to stack to pass, which i assume is the main reason for performance difference.
I've tried to use -finline-limit=100000, and while it helped a lot, the performance is still worse, and -O2 still outperforms -O3.
I am also using constructs like std::tie(a,b) = foo(a,b);
where a and b are __m256
vectors. When compiling with gcc 12.1 calls call _ZNSt11_Tuple_impl...
are generated, while with clang-15 or gcc 11.3 they are not.
What is the correct way of controlling inlining with gcc?