Why does GAS inline assembly wrapped in a function generate different instructions for the caller than a pure assembly function

Question

I've been writing some basic functions using GCC's asm to practice for an actual application.

My functions pretty, wrap, and pure generate the same instructions to unpack a 64 bit integer into a 128 bit vector. add1 and add2 which call pretty and wrap respectively also generate the same instructions. But add3 differs by saving its xmm0 register by pushing it to the stack rather than by copying it to another xmm register. This I don't understand because the compiler can see the details of pure to know none of the other xmm registers will be clobbered.

Here is the C++

#include <immintrin.h>

__m128i pretty(long long b) { return (__m128i){b,b}; }

__m128i wrap(long long b) {
    asm ("mov qword ptr [rsp-0x10], rdi\n"
         "vmovddup xmm0, qword ptr [rsp-0x10]\n"
         :
         : "r"(b)
         );
}

extern "C" __m128i pure(long long b);
asm (".text\n.global pure\n\t.type pure, @function\n"
     "pure:\n\t"
     "mov qword ptr [rsp-0x10], rdi\n\t"
     "vmovddup xmm0, qword ptr [rsp-0x10]\n\t"
     "ret\n\t"
     );

__m128i add1(__m128i in, long long in2) { return in + pretty(in2);}
__m128i add2(__m128i in, long long in2) { return in + wrap(in2);}
__m128i add3(__m128i in, long long in2) { return in + pure(in2);}

Compiled with g++ -c so.cpp -march=native -masm=intel -O3 -fno-inline and disassembled with objdump -d -M intel so.o | c++filt.

so.o:     file format elf64-x86-64


Disassembly of section .text:

0000000000000000 <pure>:
   0:   48 89 7c 24 f0          mov    QWORD PTR [rsp-0x10],rdi
   5:   c5 fb 12 44 24 f0       vmovddup xmm0,QWORD PTR [rsp-0x10]
   b:   c3                      ret
   c:   0f 1f 40 00             nop    DWORD PTR [rax+0x0]

0000000000000010 <pretty(long long)>:
  10:   48 89 7c 24 f0          mov    QWORD PTR [rsp-0x10],rdi
  15:   c5 fb 12 44 24 f0       vmovddup xmm0,QWORD PTR [rsp-0x10]
  1b:   c3                      ret
  1c:   0f 1f 40 00             nop    DWORD PTR [rax+0x0]

0000000000000020 <wrap(long long)>:
  20:   48 89 7c 24 f0          mov    QWORD PTR [rsp-0x10],rdi
  25:   c5 fb 12 44 24 f0       vmovddup xmm0,QWORD PTR [rsp-0x10]
  2b:   c3                      ret
  2c:   0f 1f 40 00             nop    DWORD PTR [rax+0x0]

0000000000000030 <add1(long long __vector(2), long long)>:
  30:   c5 f8 28 c8             vmovaps xmm1,xmm0
  34:   48 83 ec 08             sub    rsp,0x8
  38:   e8 00 00 00 00          call   3d <add1(long long __vector(2), long long)+0xd>
  3d:   48 83 c4 08             add    rsp,0x8
  41:   c5 f9 d4 c1             vpaddq xmm0,xmm0,xmm1
  45:   c3                      ret
  46:   66 2e 0f 1f 84 00 00    nop    WORD PTR cs:[rax+rax*1+0x0]
  4d:   00 00 00

0000000000000050 <add2(long long __vector(2), long long)>:
  50:   c5 f8 28 c8             vmovaps xmm1,xmm0
  54:   48 83 ec 08             sub    rsp,0x8
  58:   e8 00 00 00 00          call   5d <add2(long long __vector(2), long long)+0xd>
  5d:   48 83 c4 08             add    rsp,0x8
  61:   c5 f9 d4 c1             vpaddq xmm0,xmm0,xmm1
  65:   c3                      ret
  66:   66 2e 0f 1f 84 00 00    nop    WORD PTR cs:[rax+rax*1+0x0]
  6d:   00 00 00

0000000000000070 <add3(long long __vector(2), long long)>:
  70:   48 83 ec 18             sub    rsp,0x18
  74:   c5 f8 29 04 24          vmovaps XMMWORD PTR [rsp],xmm0
  79:   e8 00 00 00 00          call   7e <add3(long long __vector(2), long long)+0xe>
  7e:   c5 f9 d4 04 24          vpaddq xmm0,xmm0,XMMWORD PTR [rsp]
  83:   48 83 c4 18             add    rsp,0x18
  87:   c3                      ret

Why did you define pure as extern "C", and not the others? I suspect this is the reason for the difference, since you have forced the compiler to follow "C" calling convention. — Matt Jordan, Mar 30 '16 at 16:09
It was a recommendation from this site (https://www.cs.uaf.edu/2011/fall/cs301/lecture/10_12_asm_c.html) and then I didn't have to worry about the name mangling. — chew socks, Mar 30 '16 at 16:53
The normal way to write `pretty` that doesn't depend on how `immintrin.h` defines `__m128i` is `_mm_set1_epi64x(b)`. It compiles the same: gcc chooses store/`vmovddup` (worse latency, one fewer ALU uop), clang chooses `vmovq xmm0, rdi` / `vpbroadcastq xmm0, xmm0` (better latency, two port5 uops on Haswell) — Peter Cordes, Mar 30 '16 at 17:26

Timothy Baldwin · Answer 1 · 2016-03-30T17:31:35.987

2

GCC does not understand assembly language.

Since pure is an external function it cannot determine which registers it alters so according to the ABI has to assume all the xmm registers are changed.

wrap has undefined behaviour as the asm statement clobbers xmm0 and [rsp-0x10] which are not listed as clobbers or outputs (to a value which may or may not depend on b), and the function has no return statement.

Edit: The ABI does not apply to inline assembly, I expect your program will not work if you remove -fno-inline from the command line.

edited Mar 30 '16 at 17:31

answered Mar 30 '16 at 16:21

Timothy Baldwin

3,551
1
14
23

I was under the impression that the `xmm0` was OK but now that I think about it I see it; but I thought the `red zone` which I think `[rsp-0x10]` falls in has no guarantee about consistency across function calls. Wouldn't the lack of a return statement be covered by the calling convention which specifies `xmm0` as the return register; so that's where any caller of `wrap` will go to retrieve the value described by the function declaration? – chew socks Mar 30 '16 at 17:01
I added `xmm0` and `memory` to the clobber list (does that fix it?) and the only change was the addition of the `vzeroupper` instruction immediately after `vmovddup`. – chew socks Mar 30 '16 at 17:04
@chewsocks: you [must not clobber the red zone from inline asm](http://stackoverflow.com/a/34522750/224132), because there doesn't even seem to be a way to tell gcc that you want to do so. If you want a value in memory for `vmovddup`, write it like this: `asm ("vmovddup %[result], %[src]" : [result] "=x" (output) : [src] "m" (b) ); return output;`. Then gcc decides what memory to use, and can load directly from somewhere other than the stack. It doesn't force gcc to load/store/reload if the value was already in memory. [working example on godbolt](https://godbolt.org/g/n9uEaO). – Peter Cordes Mar 30 '16 at 17:36
Also, @chewsocks: requiring `-masm=intel` to build is not common practice. AT&T syntax isn't *that* bad. Unless this is an in-house project that other people don't need to build, it's usually a good idea to just suck it up and use the standard AT&T syntax for GNU inline asm. Or even use the dialect alternatives syntax: `"vmovddup {%[src], %[result] | %[src], %[result]}"`, IIRC. – Peter Cordes Mar 30 '16 at 17:43

Why does GAS inline assembly wrapped in a function generate different instructions for the caller than a pure assembly function

1 Answers1