You probably want to replace this inline asm instead of even trying to fix and/or improve it. You mentioned something about "secure", so many non-data-dependent performance to avoid timing side channels? You might want to ensure branchless code.
I'd suggest using intrinsics for _mm_max_ss
(SSE1) if you want to make sure the compiler uses that. But really a ternary will reliably compile to branchless code. What is the instruction that gives branchless FP min and max on x86?
This code may be unsafe because it doesn't tell the compiler it needs two x87 stack registers, so the x87 stack can't already be full. But anyway, this is just x87 code for Pentium Pro. (using fcomi
and fcmov
conditional-move) See http://www.ray.masmcode.com/tutorial/index.html for x87 in general.
In AT&T syntax, the x87 stack registers are %st(0) .. %st(7)
, with %st
being a short name for the top of the stack.
There are four %0..%3
operands to match the 4 input constraints: %0
is the "=m"(maxval)
memory output, %3
is the maxval
memory input.
Read the manual for GNU C extended asm, and/or tutorials / guides: https://stackoverflow.com/tags/inline-assembly/info
In AT&T syntax, fld
without an l
(double-precision) or s
(single-precision) suffix means flds
: single-precision. So apparently this code is for float
. (I checked by compiling this and disassembling the machine code to see unambiguously which form of fld
we got.)
You can always put inline asm in a function and look at the compiler output to see how the compiler substitutes operands into the asm template. Using gcc -fverbose-asm
will even get GCC to put comments on those asm lines, naming the C variable that it substituted!
See How to remove "noise" from GCC/clang assembly output? for how to look at compiler asm output.
float findmax_asm(const float *arr, size_t len)
{
float maxval = -INFINITY;
for (size_t i=0 ; i<len ; i++) {
float val = arr[i];
float temp;
asm volatile("fld %2 \n"
"fld %3 \n"
"fcomi \n"
"fcmovbe %%st(1), %%st \n"
"fstp %0 \n"
"fstp %1 \n"
:"=m"(maxval), "=m"(temp)
:"m"(val), "m"(maxval));
}
return maxval;
}
On Godbolt: gcc9.1 -O3 -m32 -fverbose-asm
gives us some init code and then this main loop. -mfpmath=387
is the default for 32-bit mode.
Notice that gcc copies from the array element (%eax)
to a local val
on the stack before the inline asm. I have no idea why the optimizer can't see that the "m"(val)
input operand can just be the array element itself; the inline asm constraints tell the compiler that val
isn't modified. Fortunately that extra store-forwarding copy latency isn't part of the loop-carried latency bottleneck, but it still sucks.
.L20:
flds (%eax) # MEM[base: _18, offset: 0B]
fstps 8(%esp) # val
## inline asm starts here
fld 8(%esp) # val
fld 4(%esp) # maxval # st(0) = maxval, st(1) = val
fcomi # implicitly %st(1), %st i.e. fcomi val, maxval
fcmovbe %st(1), %st # copy if maxval<=val
fstp 4(%esp) # maxval # store updated maxval back to memory
fstp 12(%esp) # temp # and uselessly store temp = val
## and ends here
addl $4, %eax #, ivtmp.25
cmpl %edx, %eax # _25, ivtmp.25 # induction-variable temporary pointers invented by the compiler to turn arr[i] into a pointer-increment
jne .L20 #, # }while(p != endp);
# loop ends with maxval in memory.
flds 4(%esp) # maxval
.L23:
addl $16, %esp #,
ret # with %st = maxval return value
As you can see, the use of "m"
constraints forces maxval
to be stored/reloaded inside the loop, introducing that extra latency into the loop-carried dependency chain.
Much better would be using proper constraints to ask for inputs/outputs in x87 stack registers.
There's also no point having that temp
output instead of just clearing the x87 stack with fstp %st
. We can't use fcomip
because we need the input as a source for fcmov
.
But really I wouldn't suggest trying to optimize and fix this inline asm; just let the compiler emit efficient code for you.
Pure C equivalent
I used a ternary to better represent the branchless nature of this code.
maxval = (val < maxval) ? maxval : val;
compiles with gcc -m32
to the same fcomi
/ fcmovbe
with operands in the same places. This is the case even without -ffast-math
, so the NaN-handling semantics are the same.
Notice how flag-setting works for the unordered case for fcomi/fucomi
: CF=0 or ZF=0 unambiguously means that neither input was NaN so they are ordered wrt. each other.
As a bonus, it also compiles efficiently with -mfmpath=sse
(the default for 64-bit code). With that formulation of the condition, it matches the semantics of maxss (%eax), %xmm0
to load-and-max from memory in 1 instruction.
GCC's default for 32-bit code is -mfpmath=387
. Modern GCC defaults to assuming a Pentium-Pro compatible 32-bit target, but you can still use -march=skylake
if you want.
#include <stdint.h>
#include <stdlib.h>
#include <math.h>
float findmax_c(const float *arr, size_t len)
{
float maxval = -INFINITY;
for (size_t i=0 ; i<len ; i++) {
float val = arr[i];
maxval = (val < maxval) ? maxval : val; // maxss (mem), reg
}
return maxval;
}
inner loop with gcc -O3 -m32
is:
.L4:
flds (%eax) # MEM[base: _1, offset: 0B]
fxch %st(1) #
fcomi %st(1), %st #,
fcmovbe %st(1), %st #,,
fstp %st(1) #
addl $4, %eax #, ivtmp.9
cmpl %eax, %edx # ivtmp.9, _5
jne .L4 #,
ret
Note the fxch
to put the operands in the right order for fcomi
, and the fstp %st(1)
to pop val
from the x87 stack leaving maxval
at the top.
And if you compile with -mfpmath=sse
(for example in a 64-bit build), you get efficient maxss
. And with -ffast-math
, you get auto-vectorization with maxps
. (Possibly different NaN semantics from different orderings are the obstacle I think; unlike FP add/mul/sub, max
is associative.)