Running In-Line Assembly in Linux Environment (Using GCC/G++)

Question

So I have a very basic program written in C (.c file) with an in-line assembly coding part. I want to convert the .c file into assembly output which I know but don't know how to compile that code for a Linux environment.

When using gcc or g++ for .cpp files, I get errors not recognizing asm instructions.

Now this code works as intended in Visual Studio besides me changing the brackets for the asm code to parenthesis. However I still get errors. Bunch of undefined references to the variables.

The changes I made from the working code are changing brackets to parentheses, putting assembly instruction in quotation marks (found online, could be wrong).

In short, I want the code below be able to be compiled successfully in a linux environment using the command gcc. I don't know the syntax but the code works, just not for linux/.

#include <stdio.h>
int main()

{

float num1, num2, sum, product;
float sum, product;
float f1, f2, f3, fsum, fmul;

printf("Enter two floating point numbers: \n");
scanf("%f %f", &num1, &num2);


__asm__ 
(
    "FLD num1;"
    "FADD num2;"
    "FST fsum;"
);

printf("The sum of %f and %f " "is" " %f\n", num1, num2, fsum);
printf("The hex equivalent of the numbers and sum is %x + %x = %x\n", num1, num2, fsum);

return 0;
}

score 2 · Answer 1 · edited May 23 '17 at 12:07

GNU C inline asm is designed to not need data-movement instructions at the start/end of an asm statement. Any time you write a mov or fld or something as the first instruction in inline asm, you're defeating the purpose of the constraint system. You should have just asked the compiler to put the data where you wanted it in the first place.

Also, writing new x87 code in 2016 is usually a waste of time. It's weird and a lot different from the normal way of doing FP math (scalar or vector instructions in xmm registers). You'll probably get better results by translating ancient asm code into pure C, if it was hand-tuned for very different microarchitectures, or doesn't take advantage of SSE instructions. If you do still want to write x87 code, then see the guide in the x86 tag wiki.

If you're trying to learn asm using GNU C inline asm, just don't. Pick any other way to learn asm, e.g. writing whole functions and calling them from C. See also the bottom of that answer for a collection of links to writing good GNU C inline asm.

There are special rules for x87 floating-point operands, because the x87 register stack isn't random-access. This makes inline-asm even more difficult to use than it already is for "normal" stuff. It also appears more difficult than normal to get optimal code.

In our case, we know we'll need one input operand on the top of the FP stack, and produce our result there. Asking the compiler to do this for us means we don't need any instructions beyond the fadd.

  asm (
    "fadd %[num2]\n\t"
    : "=t" (fsum)                                  // t is the top of the register stack
    : [num1] "%0" (num1), [num2] "f" (num2)         // 0 means same reg as arg 0, and the % means they're commutative.  gcc doesn't allow an input and output to both use "t" for somre reason.  For integer regs, naming the same reg for an input and an output works, instead of using "0".
    : // "st(1)"  // we *don't* pop the num2 input, unlike the FYL2XP1 example in the gcc manual
      // This is optimal for this context, but in other cases faddp would be better
      // we don't need an early-clobber "=&t" to prevent num2 from sharing a reg with the output, because we already have a "0" constraint
  );

See the docs for constraint modifiers for an explanation of the %0.

Before the fadd: num2 is %st(0). num1 is either in memory or in another FP stack register. The compiler chooses which, and fills the register name or effective address.

This should hopefully get the compiler to pop the stack afterward the correct amount of times. (Note that fst %0 was pretty silly when the output constraint had to be an FP stack register. It was likely to end up being a no-op like fst %st(0) or something.)

I don't see an easy way to optimize this to use faddp if both FP values are already in %st registers. e.g. faddp %st1 would be ideal if num1 was in %st1 before, but wasn't still needed in an FP register.

Here's a complete version that actually compiles, and works even in 64bit mode since I wrote a type-punning wrapper function for you. This is needed for any ABI that passes some FP args in FP registers to varargs functions.

#include <stdio.h>
#include <stdint.h>

uint32_t pun(float x) {
  union fp_pun {
    float single;
    uint32_t u32;
  } xu = {x};
  return xu.u32;
}


int main()
{
  float num1, num2, fsum;

  printf("Enter two floating point numbers: \n");
  scanf("%f %f", &num1, &num2);

  asm (
    "fadd %[num2]\n\t"
    : "=t" (fsum)
    : [num1] "%0" (num1), [num2] "f" (num2)  // 0 means same reg as arg 0, and the % means it's commutative with the next operand.  gcc doesn't allow an input and output to both use "t" for some reason.  For integer regs, naming the same reg for an input and an output works, instead of using "0".
    : // "st(1)"  // we *don't* pop the num2 input, unlike the FYL2XP1 example in the gcc manual
      // This is optimal for this context, but in other cases faddp would be better
      // we don't need an early-clobber "=&t" to prevent num2 from sharing a reg with the output, because we already have a "0" constraint
  );

  printf("The sum of %f and %f is %f\n", num1, num2, fsum);
  // Use a union for type-punning.  The %a hex-formatted-float only works for double, not single
  printf("The hex equivalent of the numbers and sum is %#x + %#x = %#x\n",
         pun(num1), pun(num2), pun(fsum));

  return 0;
}

See how it compiles on the Godbolt Compiler Explorer.

Take out the -m32 to see just how silly it is to get the data into x87 registers just for one add, in normal code that uses SSE for FP math. (esp. since they also have to be converted to double-precision for printf after scanf gives us single-precision.)

gcc ends up making some pretty inefficient-looking x87 code for 32bit as well. It ends up having both args in regs, since it loaded it from a single-precision in preparation for storing as a double. For some reason it duplicates the value on the FP stack instead of storing as double before doing the fadd.

So in this case an "f" constraint makes better code than an "m" constraint, and I don't see an easy way with AT&T syntax to specify single-precision operand-size for a memory operand without breaking the asm for register operands. (fadds %st(1) doesn't assemble, but fadd (mem) doesn't assemble either with clang. GNU as defaults to single-precision memory operands, apparently.) With Intel syntax, the operand-size modified is attached to the memory operand, and will be there if the compiler chooses a memory operand, otherwise not.

Anyway, this sequence would be better than what gcc emits, because it avoids the fld %st(1):

    call    __isoc99_scanf
    flds    -16(%ebp)
    subl    $12, %esp      # make even more space for args for printf beyond what was left after scanf
    fstl    (%esp)         # (double)num1

    flds    -12(%ebp)
    fstl    8(%esp)        # (double)num2

    faddp  %st(1)          # pops both inputs, leaving only fsum in %st(0)
    fsts    -28(%ebp)      # store the single-precision copy
    fstpl   16(%esp)       # (double)fsum
    pushl   $.LC3
    call    printf

But gcc doesn't think of doing it this way, apparently. Writing the inline asm to use faddp makes gcc do extra fld %st(1) before the faddp instead of convincing it to store the double args for printf before doing the add.

Even better would be if the single-precision stores were set up so they could be args for the type-pun printf, instead of having to be copied again for that. If writing the function by hand, I'd have scanf store results into locations that work as args for printf.

I might have gone with something like: https://godbolt.org/g/UrRcSg . If you use an output operand taking a single register constraint like `=t` or `=u` and there is at least one `f` input operand constraint the rule of thumb is to declare all the floating point (register) output operands as early clobber. See https://gcc.gnu.org/onlinedocs/gcc-4.2.0/gcc/Extended-Asm.html in the section _i386 floating point asm operands_ — Michael Petch, Apr 23 '16 at 07:48
@MichaelPetch: Thanks, I should have read up on x87 inline-asm rules. I just made some assumptions about how the FP stack was handled that in hindsight were too simplistic. One of the error messages I got along the way to arriving at my code recommended an early-clobber on the result, but I didn't understand why. I might update this answer tomorrow if I get back to it. — Peter Cordes, Apr 23 '16 at 08:01
One other observation. I notice you mention _CLANG_ in the comments. I don't believe that _CLANG_ will properly compile `fadds %[num1]\n\t` if a floating point register is passed (memory operand will be okay) — Michael Petch, Apr 23 '16 at 08:04
@MichaelPetch: You're right; gcc chokes on it too if you tell godbolt to actually make a binary. `gcc -S` doesn't even try to assemble, so asm syntax errors aren't detected by gcc. Intel syntax puts the operand-size qualifier into the memory-operand syntax if a memory operand is used, so it's actually an advantage here. Anyway, I think for this progam, we might get better code from forcing a memory operand, since we know we're dealing with values in memory. (Normally that would be bad, though.) — Peter Cordes, Apr 23 '16 at 08:14
I think the reason why you are seeing it choose the FPU registers over the memory operands is because of the reusage they can get for the `printf` statements. If you remove the `printf` statements and mark the `__asm__` block `__volatile__` (just as an experiment) you should see it favor the memory operands and the code will be shortened : https://godbolt.org/g/fRajHV — Michael Petch, Apr 23 '16 at 08:35
@MichaelPetch: yup, I ended up taking out the memory constraint option for this case, because even if you force a memory constraint, gcc makes worse code with the printf still in. I did still use `"f"`, not `"u"`, because the `"0"` constraint on the other operand prevents `"f"` from choosing the reg which will be clobbered. I could have used `faddp` with an `st(1)` clobber to tell the compiler it popped that operand (and then I'd need a `"u"` constraint), but gcc makes worse code, not better, in that case :( Stupid compiler, it could store `printf`'s args before the add, instead of after. — Peter Cordes, Apr 23 '16 at 18:53
Only change I would have added to your last change would be using "%0" instead of "0". It would allow the compiler to swap the two operands (`num1` and `num2`) around if one yielded a a better path. You can get away with it because add is commutative. — Michael Petch, Apr 23 '16 at 19:28
@MichaelPetch: Thanks, I'd forgotten what % did. I remembered reading at one point there was a way to tell the compiler about commutative operands, but I'd forgotten what it was, xD. — Peter Cordes, Apr 23 '16 at 22:23

score 0 · Accepted Answer · answered Apr 22 '16 at 02:44

0

In-line assembly in GCC is translated literally to the generated assembly source; since the variables don't exist in the assembly, there's no way what you have written can work.

The way to make it work is to use extended assembly, which annotates the assembly with modifiers that GCC will use to translate the assembly when the source is compiled.

__asm__
(
  "fld %1\n\t"
  "fadd %2\n\t"
  "fst %0"
  : "=f" (fsum)
  : "f" (num1), "f" (num2)
  :
);

answered Apr 22 '16 at 02:44

Ignacio Vazquez-Abrams

776,304
153
1,341
1,358

Ok that makes sense but when making that change, I get an error saying output constraint 0 must specify a single register in line 13 so I tried adding the following %eax\n but no luck. (Line 13 is the "fst %0") – user5894146 Apr 22 '16 at 02:54
Try replacing the `=f` with `=t`. I'm pretty much still garbage at this. – Ignacio Vazquez-Abrams Apr 22 '16 at 02:57
That worked perfectly after I added the '&' to make final being '"&=t"'. Thanks for your help, I really appreciate it, this was really a small portion of the bigger code I'm working on, I get the general idea so I should be able to do the rest. Once again, I appreciate the help since I am not familiar with extended assembly and linux itself. – user5894146 Apr 22 '16 at 03:00
Note that this is part of GCC itself, not Linux; the same principle applies to all platforms GCC runs on regardless of the OS. – Ignacio Vazquez-Abrams Apr 22 '16 at 03:04
Always try to avoid writing load and store instructions at the start/end of your inline asm. That's a clear sign you should be using constraints to get the compiler to put data where you want it. My version only has a single insn in the inline asm. e.g. using `"=t"` for the result means you don't need an `fst`. – Peter Cordes Apr 23 '16 at 07:09

Running In-Line Assembly in Linux Environment (Using GCC/G++)

2 Answers2

Linked