GNU C inline asm is designed to not need data-movement instructions at the start/end of an asm
statement. Any time you write a mov
or fld
or something as the first instruction in inline asm, you're defeating the purpose of the constraint system. You should have just asked the compiler to put the data where you wanted it in the first place.
Also, writing new x87 code in 2016 is usually a waste of time. It's weird and a lot different from the normal way of doing FP math (scalar or vector instructions in xmm registers). You'll probably get better results by translating ancient asm code into pure C, if it was hand-tuned for very different microarchitectures, or doesn't take advantage of SSE instructions. If you do still want to write x87 code, then see the guide in the x86 tag wiki.
If you're trying to learn asm using GNU C inline asm, just don't. Pick any other way to learn asm, e.g. writing whole functions and calling them from C. See also the bottom of that answer for a collection of links to writing good GNU C inline asm.
There are special rules for x87 floating-point operands, because the x87 register stack isn't random-access. This makes inline-asm even more difficult to use than it already is for "normal" stuff. It also appears more difficult than normal to get optimal code.
In our case, we know we'll need one input operand on the top of the FP stack, and produce our result there. Asking the compiler to do this for us means we don't need any instructions beyond the fadd
.
asm (
"fadd %[num2]\n\t"
: "=t" (fsum) // t is the top of the register stack
: [num1] "%0" (num1), [num2] "f" (num2) // 0 means same reg as arg 0, and the % means they're commutative. gcc doesn't allow an input and output to both use "t" for somre reason. For integer regs, naming the same reg for an input and an output works, instead of using "0".
: // "st(1)" // we *don't* pop the num2 input, unlike the FYL2XP1 example in the gcc manual
// This is optimal for this context, but in other cases faddp would be better
// we don't need an early-clobber "=&t" to prevent num2 from sharing a reg with the output, because we already have a "0" constraint
);
See the docs for constraint modifiers for an explanation of the %0
.
Before the fadd
: num2
is %st(0)
. num1
is either in memory or in another FP stack register. The compiler chooses which, and fills the register name or effective address.
This should hopefully get the compiler to pop the stack afterward the correct amount of times. (Note that fst %0
was pretty silly when the output constraint had to be an FP stack register. It was likely to end up being a no-op like fst %st(0)
or something.)
I don't see an easy way to optimize this to use faddp
if both FP values are already in %st
registers. e.g. faddp %st1
would be ideal if num1
was in %st1
before, but wasn't still needed in an FP register.
Here's a complete version that actually compiles, and works even in 64bit mode since I wrote a type-punning wrapper function for you. This is needed for any ABI that passes some FP args in FP registers to varargs functions.
#include <stdio.h>
#include <stdint.h>
uint32_t pun(float x) {
union fp_pun {
float single;
uint32_t u32;
} xu = {x};
return xu.u32;
}
int main()
{
float num1, num2, fsum;
printf("Enter two floating point numbers: \n");
scanf("%f %f", &num1, &num2);
asm (
"fadd %[num2]\n\t"
: "=t" (fsum)
: [num1] "%0" (num1), [num2] "f" (num2) // 0 means same reg as arg 0, and the % means it's commutative with the next operand. gcc doesn't allow an input and output to both use "t" for some reason. For integer regs, naming the same reg for an input and an output works, instead of using "0".
: // "st(1)" // we *don't* pop the num2 input, unlike the FYL2XP1 example in the gcc manual
// This is optimal for this context, but in other cases faddp would be better
// we don't need an early-clobber "=&t" to prevent num2 from sharing a reg with the output, because we already have a "0" constraint
);
printf("The sum of %f and %f is %f\n", num1, num2, fsum);
// Use a union for type-punning. The %a hex-formatted-float only works for double, not single
printf("The hex equivalent of the numbers and sum is %#x + %#x = %#x\n",
pun(num1), pun(num2), pun(fsum));
return 0;
}
See how it compiles on the Godbolt Compiler Explorer.
Take out the -m32
to see just how silly it is to get the data into x87 registers just for one add, in normal code that uses SSE for FP math. (esp. since they also have to be converted to double-precision for printf
after scanf
gives us single-precision.)
gcc ends up making some pretty inefficient-looking x87 code for 32bit as well. It ends up having both args in regs, since it loaded it from a single-precision in preparation for storing as a double. For some reason it duplicates the value on the FP stack instead of storing as double before doing the fadd
.
So in this case an "f"
constraint makes better code than an "m"
constraint, and I don't see an easy way with AT&T syntax to specify single-precision operand-size for a memory operand without breaking the asm for register operands. (fadds %st(1)
doesn't assemble, but fadd (mem)
doesn't assemble either with clang. GNU as defaults to single-precision memory operands, apparently.) With Intel syntax, the operand-size modified is attached to the memory operand, and will be there if the compiler chooses a memory operand, otherwise not.
Anyway, this sequence would be better than what gcc emits, because it avoids the fld %st(1)
:
call __isoc99_scanf
flds -16(%ebp)
subl $12, %esp # make even more space for args for printf beyond what was left after scanf
fstl (%esp) # (double)num1
flds -12(%ebp)
fstl 8(%esp) # (double)num2
faddp %st(1) # pops both inputs, leaving only fsum in %st(0)
fsts -28(%ebp) # store the single-precision copy
fstpl 16(%esp) # (double)fsum
pushl $.LC3
call printf
But gcc doesn't think of doing it this way, apparently. Writing the inline asm to use faddp
makes gcc do extra fld %st(1)
before the faddp
instead of convincing it to store the double
args for printf before doing the add.
Even better would be if the single-precision stores were set up so they could be args for the type-pun printf, instead of having to be copied again for that. If writing the function by hand, I'd have scanf store results into locations that work as args for printf.