You can't make your inline assembly code portable to Microsoft's C/C++ compiler for two reasons. The first is that syntax for asm statements is too different. Microsoft's compiler expects something like asm { mov rax, [rbp + 8] }
instead of asm("movq -8(%rbp), %rax\n\t")
. The second is that Microsoft 64-bit compilers don't support inline assembly.
So you might as well do it right and use GCC's extended syntax. As it is your inline assembly is horribly fragile. You can't depend val
being located at -8(%rbp)
. The compiler might not even put it on the stack. You also can can't assume that the compiler won't mind you trashing RAX, XMM0 and XMM1.
So to do it right you need to tell the compilers what variables you want to use and what registers you're trashing. Plus you you can let the compiler handle loading 1.0 into an XMM register. Something like this:
asm ("movq (%0), %%xmm1\n\t"
"addsd %1, %%xmm1\n\t"
"movsd %%xmm1, (%0)\n\t"
: /* no output operands */
: "r" (val), "x" (1.0)
: "xmm1", "memory");
The "r" (val)
input operand tells the compiler to put val
into a general purpose register and then substitute that register name into %0
where ever it appears in the string. Similarly the "x" (1.0)
tell the compiler to put 1.0 into an XMM register, substituting it for %1
. The clobbers tell the compiler that the XMM1 register is modified by the statement along with something in memory. You might also notice that I've swapped the operands on ADDSD so that only one register is modified by the statement.
And here's the generated assembly when I compile it the version of GCC I have installed on my computer:
foo:
pushq %rbp
movq %rsp, %rbp
movq %rcx, 16(%rbp)
movq 16(%rbp), %rax
movsd .LC2(%rip), %xmm0
/APP
movq (%rax), %xmm1
addsd %xmm0, %xmm1
movsd %xmm1, (%rax)
/NO_APP
popq %rbp
ret
.LC2:
.long 0
.long 1072693248
Looks like my version of GCC decided to store val
in 16(%rbp)
instead of -8(%rbp)
. Your code wasn't even portable to other versions of GCC, let alone Microsoft's compiler. Lets look at what I get when I compile it with optimization turned on:
foo:
movsd .LC0(%rip), %xmm0
/APP
movq (%rcx), %xmm1
addsd %xmm0, %xmm1
movsd %xmm1, (%rcx)
/NO_APP
ret
Look how short and sweet that function is. The compiler has eliminated all that unnecessary boiler plate code that setups the stack frame. Also since val
is passed to the function in RCX, the compiler just uses that register in the inline assembly directly. No need to store it on the stack only to immediately load it back into another register.
Of course, just with like your own code, none of this is remotely compatible with Microsoft's compiler. They only way to make it compatible is not to use inline assembly at all. Fortunately that's an option, and I don't just mean using *val + 1.0
. To do this you need to use Intel's intrinsics, which are support both by GCC, Microsoft C/C++ along with Clang and Intel's own compiler. Here's an example:
#include <emmintrin.h>
void foo(double *val) {
__m128d a = _mm_load_sd(val);
const double c = 1.0;
__m128d b = _mm_load_sd(&c);
a = _mm_add_sd(a, b);
_mm_store_sd(val, a);
}
My compiler does something hideous with this when compiling without optimization, but here's what it looks like with optimization:
foo:
movsd (%rcx), %xmm0
addsd .LC0(%rip), %xmm0
movlpd %xmm0, (%rcx)
ret
The compiler is smart enough to know that it can use the 1.0 constant stored in memory directly in the ADDSD instruction.