0

It has been a while since I started working with SSE/AVX intrinsic functions. I recently began writing a header for matrix transposition. I used a lot of if constexpr branches so that the compiler always selects the optimal instruction set depending on some template parameters. Now I wanted to check if everything works as expected by looking into the local disassembly with objdump. When using Clang, I get a clear output which basically contains only the assembly instructions corresponding to the utilized intrinsic functions. However, if I use GCC, the disassembly is quite bloated with extra instructions. A quick check on Godbolt shows me that those extra instructions in the GCC disassembly shouldn't be there.

Here is a small example:

#include <x86intrin.h>
#include <array>

std::array<__m256, 1> Test(std::array<__m256, 1> a)
{
    std::array<__m256, 1> b;

    b[0] = _mm256_unpacklo_ps(a[0], a[0]);
    return b;
}

I compile with -march=native -Wall -Wextra -Wpedantic -pthread -O3 -DNDEBUG -std=gnu++1z. Then I use objdump -S -Mintel libassembly.a > libassembly.dump on the object file. For Clang (6.0.0), the result is:

In archive libassembly.a:

libAssembly.cpp.o:     file format elf64-x86-64


Disassembly of section .text:

0000000000000000 <_Z4TestSt5arrayIDv8_fLm1EE>:
   0:   c4 e3 7d 04 c0 50       vpermilps ymm0,ymm0,0x50
   6:   c3                      ret    

which is the same as Godbolt returns: Godbolt - Clang 6.0.0

For GCC (7.4) the output is

In archive libassembly.a:

libAssembly.cpp.o:     file format elf64-x86-64


Disassembly of section .text:

0000000000000000 <_Z4TestSt5arrayIDv8_fLm1EE>:
   0:   4c 8d 54 24 08          lea    r10,[rsp+0x8]
   5:   48 83 e4 e0             and    rsp,0xffffffffffffffe0
   9:   c5 fc 14 c0             vunpcklps ymm0,ymm0,ymm0
   d:   41 ff 72 f8             push   QWORD PTR [r10-0x8]
  11:   55                      push   rbp
  12:   48 89 e5                mov    rbp,rsp
  15:   41 52                   push   r10
  17:   48 83 ec 28             sub    rsp,0x28
  1b:   64 48 8b 04 25 28 00    mov    rax,QWORD PTR fs:0x28
  22:   00 00 
  24:   48 89 45 e8             mov    QWORD PTR [rbp-0x18],rax
  28:   31 c0                   xor    eax,eax
  2a:   48 8b 45 e8             mov    rax,QWORD PTR [rbp-0x18]
  2e:   64 48 33 04 25 28 00    xor    rax,QWORD PTR fs:0x28
  35:   00 00 
  37:   75 0c                   jne    45 <_Z4TestSt5arrayIDv8_fLm1EE+0x45>
  39:   48 83 c4 28             add    rsp,0x28
  3d:   41 5a                   pop    r10
  3f:   5d                      pop    rbp
  40:   49 8d 62 f8             lea    rsp,[r10-0x8]
  44:   c3                      ret    
  45:   c5 f8 77                vzeroupper 
  48:   e8 00 00 00 00          call   4d <_Z4TestSt5arrayIDv8_fLm1EE+0x4d>

As you can see, there are a lot of additional instructions. In contrast to that, Godbolt does not include all these extra instructions: Godbolt - GCC 7.4

So what is going on here? I have just started learning assembly, so maybe it is totally clear to someone with assembly experience, but I am a little bit confused why GCC creates those extra instructions on my machine.

Greetings and thank you in advance.

EDIT

To avoid further confusions, I just compiled using:

gcc-7 -I/usr/local/include -O3 -march=native -Wall -Wextra -Wpedantic -pthread -std=gnu++1z -o test.o -c /<PathToFolder>/libAssembly.cpp

Output remains the same. I am not sure if this is relevant, but it generates the warning: warning: ignoring attributes on template argument ‘__m256 {aka __vector(8) float}’ [-Wignored-attributes]

Usually I surpress this warning and it shouldn't be an issue:

Implication of GCC warning: ignoring attributes on template argument (-Wignored-attributes)

Processor is Intel(R) Core(TM) i7-6700K CPU @ 4.00GHz

Here is the gcc -v:

gcc-7 -v
Using built-in specs.
COLLECT_GCC=gcc-7
COLLECT_LTO_WRAPPER=/usr/lib/gcc/x86_64-linux-gnu/7/lto-wrapper
OFFLOAD_TARGET_NAMES=nvptx-none
OFFLOAD_TARGET_DEFAULT=1
Target: x86_64-linux-gnu
Configured with: ../src/configure -v --with-pkgversion='Ubuntu 7.4.0-1ubuntu1~18.04.1' --with-bugurl=file:///usr/share/doc/gcc-7/README.Bugs --enable-languages=c,ada,c++,go,brig,d,fortran,objc,obj-c++ --prefix=/usr --with-gcc-major-version-only --program-suffix=-7 --program-prefix=x86_64-linux-gnu- --enable-shared --enable-linker-build-id --libexecdir=/usr/lib --without-included-gettext --enable-threads=posix --libdir=/usr/lib --enable-nls --with-sysroot=/ --enable-clocale=gnu --enable-libstdcxx-debug --enable-libstdcxx-time=yes --with-default-libstdcxx-abi=new --enable-gnu-unique-object --disable-vtable-verify --enable-libmpx --enable-plugin --enable-default-pie --with-system-zlib --with-target-system-zlib --enable-objc-gc=auto --enable-multiarch --disable-werror --with-arch-32=i686 --with-abi=m64 --with-multilib-list=m32,m64,mx32 --enable-multilib --with-tune=generic --enable-offload-targets=nvptx-none --without-cuda-driver --enable-checking=release --build=x86_64-linux-gnu --host=x86_64-linux-gnu --target=x86_64-linux-gnu
Thread model: posix
gcc version 7.4.0 (Ubuntu 7.4.0-1ubuntu1~18.04.1) 
Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
wychmaster
  • 712
  • 2
  • 8
  • 23
  • I cannot reproduce it with my local GCC install. – Acorn Oct 05 '19 at 12:59
  • `why GCC` - wait, which gcc? You posted a link to godbolt to gcc 7.4 that generates `vunpcklps ymm0, ymm0, ymm0`. So what is the output you are presenting? Is it for your machine? You use `-march=native`, does your local machine support SSE/AVX ? – KamilCuk Oct 05 '19 at 13:05
  • In the "bad" example, are you sure that you compiled with optimization enabled? You mentioned `-O3` in the question, but that looks to me like unoptimized output. – Jason R Oct 05 '19 at 13:08
  • @JasonR: Pretty sure. I am using CMake and the verbose output tells me that the file is built with `-march=native -Wall -Wextra -Wpedantic -pthread -O3 -DNDEBUG -std=gnu++1z -o CMakeFiles/assembly.dir/libAssembly.cpp.o` – wychmaster Oct 05 '19 at 13:24
  • @KamilCuk The output was generated on my machine with gcc 7.4 and my machine supports AVX2 instructions. – wychmaster Oct 05 '19 at 13:38
  • But if godbolt get's it right for what you have specified, and you present you don't, then the problem is somewhere else, in the code or environment you didn't specify. ` that the file is built with` - with what compiler? What is you cmake config? Why do you use cmake at all to check it? What is your gcc build with? `gcc -v`? `cat /proc/cpuinfo`? Let others reproduce the problem. – KamilCuk Oct 05 '19 at 13:42
  • @KamilCuk Have a look at the EDIT section of my original post. I added some extra information. I also compiled it directly without CMake and got the same result – wychmaster Oct 05 '19 at 14:00
  • Check the C++ standard library you have installed locally. I can't see the compiler's built-in include path in the `-v` output, but cmake is providing `-I/usr/local/include`. The extra assembly may be consistent with an older `std::array`, esp. if the compiler isn't sure how it is aligned... – Useless Oct 05 '19 at 14:04
  • @Useless I have installed gcc 8 and tested it, getting the same result. is there a way how I can explicitly select a STL? – wychmaster Oct 05 '19 at 14:52
  • Please don't edit an answer into the question. I updated my answer to highlight the relevant GCC option, in case that's why you felt the need to put a 2nd answer into the question. I also removed the question about gcc8 from your answer because my answer + comments already covered that. (I did make my answer state more explicitly and obviously that newer GCC doesn't fix this bug, though). In future, if you do feel like you need to add your own answer, post it as an answer. If you think an existing answer just needs a TL:DR at the top, suggest an edit or leave a comment. – Peter Cordes Oct 06 '19 at 14:19
  • @JasonR: you can tell the posted asm output wasn't `-O0` because it doesn't spill/reload `ymm0` to the stack. It does set up RBP as a frame pointer, though, another classic sign of optimization-disabled (but also of gcc's stack-alignment boilerplate). Anyway, getting the OP to post exact command-line options and GCC version was a good end result of all this doubt. – Peter Cordes Oct 06 '19 at 14:26

1 Answers1

8

Use -fno-stack-protector


Your local GCC defaults to -fstack-protector-strong but Godbolt's GCC install doesn't.

mov rax,QWORD PTR fs:0x28 is the telltale clue; Thread-local storage at fs:40 aka fs:0x28 is where GCC keeps its stack cookie constant. The call after the ret is call __stack_chk_fail (but you disassembled a .o without using objdump -dr to show relocations, so the placeholder +0 offset just looked like still a target within this function).

Since you have arrays (or a class containing an array), stack-protector-strong kicks in even though their sizes are compile-time constants. So you get the code to store the stack cookie, then check it and branch on stack overflow. (Even the array of size 1 in this MVCE is enough to trigger that.)

Making arrays on the stack with 32-byte alignment (for __m256) requires 32-byte alignment, and your GCC is older than GCC8 so you get the ridiculously clunky stack-alignment code that builds a full copy of the stack frame including a return address. Generated assembly for extended alignment of stack variables (To be clear, GCC8 still does align the stack here, just wasting fewer instructions on it.)

This is pretty much a missed optimization; gcc never actually spills or reloads to those arrays so it could have just optimized them away, along with the stack alignment, like it did without stack-protector.

More recent GCC is better at optimizing away stack alignment after optimizing away the memory for aligned locals in more cases, but this has been a persistent missed optimization in AVX code. Fortunately the cost is pretty negligible in a function that loops; as long as small helper functions inline.


Compiling on Godbolt with -fstack-protector-strong reproduces your output. Newer GCC, including current trunk pre-10, still has both missed optimizations, but stack alignment costs fewer instructions because it just uses RBP as a frame pointer and aligns RSP, then references locals relative to aligned RSP. It still checks the stack cookie (with no instructions between storing it and checking it).

On your desktop, compiling with -fno-stack-protector should make good asm.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
  • Thank you very much. Its awesome how much knowledge some people have around here :) . Compiling with `-fno-stack-protector` indeed solves the problem. However, I made a distribution upgrade to ubuntu 19.04 with GCC 8.3. The problem still remains when I compile without the flag. Shouldn't it disappear with GCC 8 +? --- The array sizes of 1 are actually an edge-case of the function template I am using where I thought that the compiler would optimize it away anyways. – wychmaster Oct 05 '19 at 22:12
  • @wychmaster: oops, you're right. I didn't look carefully and missed that stack alignment was still there even with GCC9. The asm output is shorter but that's because GCC8 has much simpler stack alignment in functions without VLAs / alloca, not because stack alignment is optimized away entirely. Silly me, fixed my answer. – Peter Cordes Oct 05 '19 at 22:36
  • @wychmaster: After inlining it only affects the caller once, not once per invocation of an inline function. Or if the caller needed stack alignment already, then there's no extra cost. But yes it *could* optimize away in the x86-64 System V calling convention. A `class` containing one `__m256[1]` member is passed and returned in vector registers, unlike in Windows x64 vectorcall where being inside a class would force pass/return by reference. https://godbolt.org/z/piojOa (Which would still optimize away when inlining, but the standalone version would have that guaranteed overhead.) – Peter Cordes Oct 05 '19 at 22:42
  • Thanks again for your explanation. I need the clean assembly only to check if my matrix transpose functions work as expected and to see if all optimizations I expect are actually applied. Apart from that, as long as there is no significant impact on my benchmarks, I can live with it. I am curious how big the effect of the MSVC overhead is. I think I have to finally start testing my code in Windows too. However, I can always use MinGW as an alternative to MSVC :p – wychmaster Oct 06 '19 at 13:38
  • @wychmaster: MinGW GCC (still AFAIK) isn't usable with AVX because of a bug where it doesn't align the stack before spilling `__m256` locals, or something like that. Use clang (it optimizes better than MSVC anyway, in general, and implements GNU C/C++ extensions). But anyway, both of them will still use the Windows x64 calling convention when targeting Windows. As long as your function inlines there's no actual pass/return overhead, though. – Peter Cordes Oct 06 '19 at 13:43
  • @wychmaster: BTW, Windows x64 vectorcall is actually fine here. https://godbolt.org/z/dz-A6m shows MSVC -Gv and (Linux) clang making efficient asm for Windows x64 vectorcall. Of course Windows x64 fastcall passed by reference, it would do that for bare `__m256`. (And even returns by reference because its dumb.) – Peter Cordes Oct 06 '19 at 13:49
  • Good to know and thanks again. As I said, as long as my benchmarks are not significantly slower in Windows, I won't complain :) . For now, as long as I can see, that a smart compiler actually performs all the optimizations that I expect, I am fine. If not, I 've probably messed something up in my C++ code that needs to be corrected. If for some reason a single compiler produces some weird asm on a specific system which doesn't affect performance, I guess it is okay. Just want to make sure that my code is optimizable in general and test it by looking into the asm. Greetings – wychmaster Oct 07 '19 at 10:18
  • Of course, the requirement is, that this specific compiler generally produces the weird asm and not just on my computer ;) . Otherwise, I have to find out, what I am doing wrong as here in this topic. Btw. many, many thx again for your explanations. – wychmaster Oct 07 '19 at 10:22
  • @wychmaster: Yeah, I get what you mean about wanting to make sure your C++ machinery can at least in theory optimize away (and that the resultant asm looks the way you want it to), and that in practice it's not bad for performance. That means you need to test it how it will actually compile. Several distros configure GCC with `-fstack-protector-strong` as the default; just your computer or even "just" every Ubuntu system! If you need `-fno-stack-protector` for it to compile well with such GCCs, document that and/or include it in your Makefile. – Peter Cordes Oct 07 '19 at 13:56