2

I'm looking at increasing the runtime performance of a C++ library that I have written and profiled. I'm very new to assembly (and inline assembly) and have a very basic question to ask.

How do I set the value of an xmm register (xmm, ymm, zmm, etc) to a constant float or double value using inline assembly? I strongly prefer not to use GCC's extended assembly to make the code more portable to MSVC. When compiling with -S, I see that GCC uses a .data section, however, I don't think I can use that in inline code.

For simplicity, let's say I want to implement the foo function in the following C code:

#include <cstdio>

void foo(double *val);
int main(int argc, char **argv) {
   double val = 0.0;

   foo(&val);
   printf("val: %lf\n", val);
   return 0;
}

void foo(double *val) {
   // return *val + 1.0.
   __asm__ (
      "movq -8(%rbp), %rax\n\t"   // move pointer from stack to rax.
      "movq (%rax), %xmm1\n\t"    // dereference pointer and move to xmm1.
      "?????????????"             // somehow move 1.0 to xmm0.
      "addsd %xmm1, %xmm0\n\t"    // add xmm1 to xmm0.
      "movsd %xmm0, (%rax)\n\t"   // move result back val.
   );
 }

I have tried using push $0x3ff0000000000000 and pushq $0x3ff0000000000000 to move the value to the stack and then potentially move it to xmm0, with the following results:

"pushq $0x3ff0000000000000\n\t" = "Error: operand type mismatch for `push'".

"push $0x3ff00000\n\t" = Segmentation fault at this instruction.

Any help would be appreciated, and thanks in advance for your time.

Helios
  • 23
  • 4
  • Answered here: http://stackoverflow.com/questions/6514537/how-do-i-specify-immediate-floating-point-numbers-with-inline-assembly – oakad Jun 02 '15 at 02:49
  • I saw that post, but it still uses the ``.data`` section when it declares ``const1: dq 1.2345``, which I can't access in inline assembly. Unless I'm misunderstanding something (which is likely). Thanks for a response though. – Helios Jun 02 '15 at 02:54
  • Try to read past the first reply: http://stackoverflow.com/a/6514824/2702398 – oakad Jun 02 '15 at 03:03
  • Are you referring to the ``push`` suggestion? or the one that uses extended assembly? - I'm trying to avoid extended assembly, and ``push`` causes a segfault. – Helios Jun 02 '15 at 03:08
  • push - that's how it is done by the C compiler anyway. – oakad Jun 02 '15 at 03:08

2 Answers2

0

You can't make your inline assembly code portable to Microsoft's C/C++ compiler for two reasons. The first is that syntax for asm statements is too different. Microsoft's compiler expects something like asm { mov rax, [rbp + 8] } instead of asm("movq -8(%rbp), %rax\n\t"). The second is that Microsoft 64-bit compilers don't support inline assembly.

So you might as well do it right and use GCC's extended syntax. As it is your inline assembly is horribly fragile. You can't depend val being located at -8(%rbp). The compiler might not even put it on the stack. You also can can't assume that the compiler won't mind you trashing RAX, XMM0 and XMM1.

So to do it right you need to tell the compilers what variables you want to use and what registers you're trashing. Plus you you can let the compiler handle loading 1.0 into an XMM register. Something like this:

asm ("movq (%0), %%xmm1\n\t"
     "addsd %1, %%xmm1\n\t"
     "movsd %%xmm1, (%0)\n\t"
     : /* no output operands */
     : "r" (val), "x" (1.0)
     : "xmm1", "memory");

The "r" (val) input operand tells the compiler to put val into a general purpose register and then substitute that register name into %0 where ever it appears in the string. Similarly the "x" (1.0) tell the compiler to put 1.0 into an XMM register, substituting it for %1. The clobbers tell the compiler that the XMM1 register is modified by the statement along with something in memory. You might also notice that I've swapped the operands on ADDSD so that only one register is modified by the statement.

And here's the generated assembly when I compile it the version of GCC I have installed on my computer:

foo:
    pushq   %rbp
    movq    %rsp, %rbp
    movq    %rcx, 16(%rbp)
    movq    16(%rbp), %rax
    movsd   .LC2(%rip), %xmm0

/APP
    movq (%rax), %xmm1
    addsd %xmm0, %xmm1
    movsd %xmm1, (%rax)
/NO_APP

    popq    %rbp
    ret

.LC2:
    .long   0
    .long   1072693248

Looks like my version of GCC decided to store val in 16(%rbp) instead of -8(%rbp). Your code wasn't even portable to other versions of GCC, let alone Microsoft's compiler. Lets look at what I get when I compile it with optimization turned on:

foo:
    movsd   .LC0(%rip), %xmm0

/APP
    movq (%rcx), %xmm1
    addsd %xmm0, %xmm1
    movsd %xmm1, (%rcx)
/NO_APP

    ret

Look how short and sweet that function is. The compiler has eliminated all that unnecessary boiler plate code that setups the stack frame. Also since val is passed to the function in RCX, the compiler just uses that register in the inline assembly directly. No need to store it on the stack only to immediately load it back into another register.

Of course, just with like your own code, none of this is remotely compatible with Microsoft's compiler. They only way to make it compatible is not to use inline assembly at all. Fortunately that's an option, and I don't just mean using *val + 1.0. To do this you need to use Intel's intrinsics, which are support both by GCC, Microsoft C/C++ along with Clang and Intel's own compiler. Here's an example:

#include <emmintrin.h>

void foo(double *val) {
    __m128d a = _mm_load_sd(val);
    const double c = 1.0;
    __m128d b = _mm_load_sd(&c);
    a = _mm_add_sd(a, b);
    _mm_store_sd(val, a);
}

My compiler does something hideous with this when compiling without optimization, but here's what it looks like with optimization:

foo:
    movsd   (%rcx), %xmm0
    addsd   .LC0(%rip), %xmm0
    movlpd  %xmm0, (%rcx)
    ret

The compiler is smart enough to know that it can use the 1.0 constant stored in memory directly in the ADDSD instruction.

Ross Ridge
  • 38,414
  • 7
  • 81
  • 112
  • The assembly also includes this: ``.LC0: .long 0 .long 1072693248 .text``. You mentioned that it is using a preset value stored in the ADDSD instruction, and this makes sense for 1.0. Can you be more detailed about the mechanics of the ``.LC0(%rip)`` addressing though? I saw my compiler do this before and was thoroughly confused. Thanks again for taking the time to post, I appreciate it! – Helios Jun 02 '15 at 03:47
  • I left out the constant definition from the second and third assembly output examples to save space. The `.LC0(%rip)` means using RIP relative addressing to access the value stored at `.LC0`. It's the same as just using `.LC0` in this context, except it's makes the code position independent. Instead of using the address of `.LC0` the assembler uses the distance of `.LC0` from the instruction or `.LC0 - .` in other words, where `.` is the address of the instruction. Since `.` is also the value of RIP when the instruction is executed you get `.LC0 - . + %rip = .LC0`. – Ross Ridge Jun 02 '15 at 03:59
0

If anyone is interested in the exact answer to my question, I'm also posting it here since I somehow managed to figure it out with sheer luck and trial/error. The whole point of this was to learn simple assembly.

void foo(double *in) {
   __asm__ (
      "movq -8(%rbp), %rax\n\t"
      "movq (%rax), %xmm1\n\t"
      "movq $0x3FF0000000000000, %rbx\n\t" 
      "movq %rbx, %xmm0\n\t"
      "addsd %xmm1, %xmm0\n\t"
      "movsd %xmm0, (%rax)\n\t"
   );
}
Helios
  • 23
  • 4