Understanding GCC inline assembly code that uses memory source operands and x87 fcomi / fcmov

Question

I'm reading the research paper Privado: Practical and Secure DNN Inference to hide the input-dependent branch. I am trying to understand the following GCC assembly code in that paper:

float temp;
asm volatile("fld %2 \n"
      "fld %3 \n"
      "fcomi \n"
      "fcmovbe %%st(1), %%st \n"
      "fstp %0 \n"
      "fstp %1 \n"
      :"=m"(maxval), "=m"(temp)
      :"m"(val), "m"(maxval));

I am confused about what each line of the code means. What does %%st(1),%%st mean? Why are there %0,%1,%2,%3 when there are only three variables temp, maxval and val?

I know the function of this code is similar to:

if(val>maxval)
   maxval=val;

But I don't know how the code operates internally. I want to change the code slightly into:

if(val>maxval)
    val2=maxval;

where val2 is a new variable

Just replace "=m"(maxval) with "=m"(val2) with val2 float C declaration? Because the = sign represent the output and parentheses represents that the value to be stored in the variable inside parentheses. — mdasari, Jul 21 '19 at 15:45
Was this code given to you by a professor for a class or was it part of a project you found online? — Michael Petch, Jul 21 '19 at 16:59
It is the code came from the paper Privado: Practical and Secure DNN Inference to hide the input-dependent branch. Thx mdasari btw. — idunno, Jul 21 '19 at 17:04

Peter Cordes · Answer 1 · 2019-07-21T18:15:51.403

You probably want to replace this inline asm instead of even trying to fix and/or improve it. You mentioned something about "secure", so many non-data-dependent performance to avoid timing side channels? You might want to ensure branchless code.

I'd suggest using intrinsics for _mm_max_ss (SSE1) if you want to make sure the compiler uses that. But really a ternary will reliably compile to branchless code. What is the instruction that gives branchless FP min and max on x86?

This code may be unsafe because it doesn't tell the compiler it needs two x87 stack registers, so the x87 stack can't already be full. But anyway, this is just x87 code for Pentium Pro. (using fcomi and fcmov conditional-move) See http://www.ray.masmcode.com/tutorial/index.html for x87 in general.

In AT&T syntax, the x87 stack registers are %st(0) .. %st(7), with %st being a short name for the top of the stack.

There are four %0..%3 operands to match the 4 input constraints: %0 is the "=m"(maxval) memory output, %3 is the maxval memory input.

Read the manual for GNU C extended asm, and/or tutorials / guides: https://stackoverflow.com/tags/inline-assembly/info

In AT&T syntax, fld without an l (double-precision) or s (single-precision) suffix means flds: single-precision. So apparently this code is for float. (I checked by compiling this and disassembling the machine code to see unambiguously which form of fld we got.)

You can always put inline asm in a function and look at the compiler output to see how the compiler substitutes operands into the asm template. Using gcc -fverbose-asm will even get GCC to put comments on those asm lines, naming the C variable that it substituted!

See How to remove "noise" from GCC/clang assembly output? for how to look at compiler asm output.

float findmax_asm(const float *arr, size_t len)
{
    float maxval = -INFINITY;
    for (size_t i=0 ; i<len ; i++) {
        float val = arr[i];
        float temp;
        asm volatile("fld %2 \n"
              "fld %3 \n"
              "fcomi \n"
              "fcmovbe %%st(1), %%st \n"
              "fstp %0 \n"
              "fstp %1 \n"
              :"=m"(maxval), "=m"(temp)
              :"m"(val), "m"(maxval));
    }
    return maxval;
}

On Godbolt: gcc9.1 -O3 -m32 -fverbose-asm gives us some init code and then this main loop. -mfpmath=387 is the default for 32-bit mode.

Notice that gcc copies from the array element (%eax) to a local val on the stack before the inline asm. I have no idea why the optimizer can't see that the "m"(val) input operand can just be the array element itself; the inline asm constraints tell the compiler that val isn't modified. Fortunately that extra store-forwarding copy latency isn't part of the loop-carried latency bottleneck, but it still sucks.

.L20:
        flds    (%eax)  # MEM[base: _18, offset: 0B]
        fstps   8(%esp)       # val

## inline asm starts here
        fld 8(%esp)     # val
fld 4(%esp)     # maxval               # st(0) = maxval,  st(1) = val
fcomi                                 # implicitly %st(1), %st  i.e.  fcomi val, maxval
fcmovbe %st(1), %st                   # copy if maxval<=val
fstp 4(%esp)    # maxval              # store updated maxval back to memory
fstp 12(%esp)   # temp                # and uselessly store temp = val
## and ends here

        addl    $4, %eax        #, ivtmp.25
        cmpl    %edx, %eax      # _25, ivtmp.25          # induction-variable temporary pointers invented by the compiler to turn arr[i] into a pointer-increment
        jne     .L20      #,                           # }while(p != endp);

# loop ends with maxval in memory.

        flds    4(%esp) # maxval
.L23:
        addl    $16, %esp       #,
        ret                                            # with %st = maxval return value

As you can see, the use of "m" constraints forces maxval to be stored/reloaded inside the loop, introducing that extra latency into the loop-carried dependency chain.

Much better would be using proper constraints to ask for inputs/outputs in x87 stack registers.

There's also no point having that temp output instead of just clearing the x87 stack with fstp %st. We can't use fcomip because we need the input as a source for fcmov.

But really I wouldn't suggest trying to optimize and fix this inline asm; just let the compiler emit efficient code for you.

Pure C equivalent

I used a ternary to better represent the branchless nature of this code.

maxval = (val < maxval) ? maxval : val;

compiles with gcc -m32 to the same fcomi / fcmovbe with operands in the same places. This is the case even without -ffast-math, so the NaN-handling semantics are the same.

Notice how flag-setting works for the unordered case for fcomi/fucomi: CF=0 or ZF=0 unambiguously means that neither input was NaN so they are ordered wrt. each other.

As a bonus, it also compiles efficiently with -mfmpath=sse (the default for 64-bit code). With that formulation of the condition, it matches the semantics of maxss (%eax), %xmm0 to load-and-max from memory in 1 instruction.

GCC's default for 32-bit code is -mfpmath=387. Modern GCC defaults to assuming a Pentium-Pro compatible 32-bit target, but you can still use -march=skylake if you want.

#include <stdint.h>
#include <stdlib.h>
#include <math.h>

float findmax_c(const float *arr, size_t len)
{
    float maxval = -INFINITY;
    for (size_t i=0 ; i<len ; i++) {
        float val = arr[i];
        maxval = (val < maxval) ? maxval : val;  // maxss (mem), reg
    }
    return maxval;
}

inner loop with gcc -O3 -m32 is:

.L4:
        flds    (%eax)  # MEM[base: _1, offset: 0B]
        fxch    %st(1)      #
        fcomi   %st(1), %st   #,
        fcmovbe %st(1), %st       #,,
        fstp    %st(1)      #
        addl    $4, %eax        #, ivtmp.9
        cmpl    %eax, %edx      # ivtmp.9, _5
        jne     .L4       #,
        ret

Note the fxch to put the operands in the right order for fcomi, and the fstp %st(1) to pop val from the x87 stack leaving maxval at the top.

And if you compile with -mfpmath=sse (for example in a 64-bit build), you get efficient maxss. And with -ffast-math, you get auto-vectorization with maxps. (Possibly different NaN semantics from different orderings are the obstacle I think; unlike FP add/mul/sub, max is associative.)

How can I have GCC compiling with -mfpmath=387 and a -march setting in order to emit the code maxval = (maxval <= temp) ? temp : maxval;? I am so new to these terms. And what should I do if I want to fix the above code to tell the compiler it needs two x87 stack registers? — idunno, Jul 21 '19 at 16:21
"it doesn't tell the compiler it needs two x87 stack registers, so the x87 stack can't already be full" not sure what you mean here. Gcc will generate the same kind of `fld` without checking for any stack overflow. As long as the number of push/pop matches... — Marc Glisse, Jul 21 '19 at 17:39
@MarcGlisse: But gcc knows if it was already using any x87 regs in the same function. The ABI requires the x87 stack to be empty on function boundaries (or holding a return value). But this inline asm statement can appear in any context and nothing tells the compiler not to emit it when all 8 `st` registers are already in use. (Unless inline asm implicitly requires the x87 stack to be not full? IIRC you need to declare x87 clobbers if you use any x87 regs.) — Peter Cordes, Jul 21 '19 at 17:43
Thanks, that clarifies your concern. (I don't know that much about x87) — Marc Glisse, Jul 21 '19 at 17:45
@MarcGlisse: consider yourself lucky :P A register-stack is kind of a mismatch for GCC's operand / clobber mechanism for describing the asm to the compiler. There is a special-case `"t"` constraint for the top of the stack (`%st(0)`), but IIRC it gets worse from there. — Peter Cordes, Jul 21 '19 at 17:48
@idunno The best way to fix this inline assembly is to get rid of it. Replace it with `maxval = (val < maxval) ? maxval : val;` as Peter Cordes said. — fuz, Jul 21 '19 at 19:23

Understanding GCC inline assembly code that uses memory source operands and x87 fcomi / fcmov

1 Answers1

Pure C equivalent

Linked