Why does gcc implement fmin and fmax in three different ways?

Question

I have a few routines here that all do the same thing: they clamp a float to the range [0,65535]. What surprises me is that the compiler (gcc -O3) uses three, count 'em, three different ways to implement float-min and float-max. I'd like to understand why it generates three different implementations. Ok, here's the C++ code:

float clamp1(float x) {
    x = (x < 0.0f) ? 0.0f : x;
    x = (x > 65535.0f) ? 65535.0f : x;
    return x;
}

float clamp2(float x) {
    x = std::max(0.0f, x);
    x = std::min(65535.0f, x);
    return x;
}

float clamp3(float x) {
    x = std::min(65535.0f, x);
    x = std::max(0.0f, x);
    return x;
}

So here's the generated assembly (with some of the boilerplate removed). Reproducible on https://godbolt.org/z/db775on4j with GCC10.3 -O3. (Also showing clang14's choices.)

CLAMP1:
    movaps  %xmm0, %xmm1
    pxor    %xmm0, %xmm0
    comiss  %xmm1, %xmm0
    ja  .L9
    movss   .LC1(%rip), %xmm0     # 65535.0f
    movaps  %xmm0, %xmm2
    cmpltss %xmm1, %xmm2
    andps   %xmm2, %xmm0
    andnps  %xmm1, %xmm2
    orps    %xmm2, %xmm0
.L9:
    ret

CLAMP2:
    pxor    %xmm1, %xmm1
    comiss  %xmm1, %xmm0
    ja  .L20
    pxor    %xmm0, %xmm0
    ret
.L20:
    minss   .LC1(%rip), %xmm0      # 65535.0f
    ret

CLAMP3:
    movaps  %xmm0, %xmm1
    movss   .LC1(%rip), %xmm0      # 65535.0f
    comiss  %xmm1, %xmm0
    ja  .L28
    ret
.L28:
    maxss   .LC2(%rip), %xmm1      # 0.0f
    movaps  %xmm1, %xmm0
    ret

So there appear to be three different implementations of MIN and MAX here:

using compare-and-branch
using minss and maxss
using compare, andps, andnps, and orps.

Can somebody clarify the following:

Are these all the same speed, or is one of them faster?
How does the compiler end up choosing all these different implementations?
What exactly is that thing with the andps, andnps, and so forth?
Would using both minss and maxss, and no branches, be faster?

See also [What is the instruction that gives branchless FP min and max on x86?](https://stackoverflow.com/q/40196817) re: the exact semantics of MAXSS / MINSS. IDK why GCC would choose to branch vs. blend, and if it was going to blend it would clearly be cheaper to blend with 0.0, since that just take an AND or ANDN to conditionally zero. — Peter Cordes, Jun 25 '22 at 02:27
I love these questions where people are looking for a deeper reason behind the compilers seemingly sub-optimal code generation. It's 50/50 just a missed optimisation or some actual subtle, easily missed cause. — fuz, Jun 25 '22 at 02:35
clang chooses minss / maxss for all 3 versions, no branching. Unless the branch is highly predictable, that's very good. GCC's cmp/blend is just silly. Go report this on https://gcc.gnu.org/bugzilla/enter_bug.cgi with the keyword `missed-optimization`. (The andn/and/or is emulating `blendvps` without using SSE4.1 instructions. If you compile with `-march=haswell -mno-avx` or something, you'll see simpler asm, but still not *good* asm.) — Peter Cordes, Jun 25 '22 at 02:36
With `-ffast-math`, GCC does optimize them to minss/maxss. I guess it's getting crossed up by the NaN semantics of the source vs. the asm instructions, not figuring out how to reverse the operands to achieve the strict-FP semantics of the source in all cases. https://godbolt.org/z/oa7v6vK8M I think clang is doing it right. — Peter Cordes, Jun 25 '22 at 02:44
peter cordes: thanks for the edits and the observations! Those are helpful. fuz: I think you may be right. — jyelon, Jun 25 '22 at 02:45
@fuz I think that's because of how common it is to hear things to the effect of "the compiler is better at optimizing than you are, so let it do its job". When people who heard that then see the compiler do something weird, they assume there's a good reason for it that they just don't know. — Joseph Sible-Reinstate Monica, Jun 25 '22 at 02:49
I've come to the conclusion that using maxss and minss is the fastest algorithm. I can't find *any* way to get gcc to generate that sequence, no matter what I try, unless I turn on -fast-math for the entire compilation unit (no way, man.) — jyelon, Jun 25 '22 at 03:57
OK, I managed to get gcc to generate the "right" code: https://godbolt.org/z/Ej91d65WG — jyelon, Jun 25 '22 at 04:34
@jyelon That's not likely to be the "right" code. You made the compiler load `0` from memory, which can be produced with almost zero-cost in a register. If you *really* need to produce the assembly instruction you want, inline assembly is your only option. Some people say don't use it, but (GCC's) inline assembly is a truly well-made and useful tool. Also, if the current answer helped, consider clicking ✔️ :) — xiver77, Jun 25 '22 at 04:56
Well, using the constant 0.0 produces about the same code: https://godbolt.org/z/c41r9P3jb I actually think peter's comment is the most helpful answer: "I guess it's getting crossed up by the NaN semantics of the source vs. the asm instructions." That was the flash of insight for me. — jyelon, Jun 25 '22 at 05:11
@xiver77 Also consider using intrinsics in this specific case. — fuz, Jun 25 '22 at 19:00

xiver77 · Accepted Answer · 2022-06-25T19:34:29.417

Are these all the same speed, or is one of them faster?

It's not the same, but the difference depends on the machine and the given input.

How does the compiler end up choosing all these different implementations?

Because the compiler is not smart as you think.

What exactly is that thing with the andps, andnps, and so forth?

It's the following logic.

float y = 65535;
float m = std::bit_cast<float>(-(x < y));
return x & m | y & ~m;

You can't do & on floats in C++ (and C), but you'll get the idea.

Would using both minss and maxss, and no branches, be faster?

Usually, yes. You can look at the latency and throughput for each instruction at (https://uops.info/table.html). I'd write the following in assembly.

xorps xmm1, xmm1
minss xmm0, [_ffff]
maxss xmm0, xmm1

However, if you know that a certain branch is very likely to be taken, jumping over a with a branch can be faster. Let's see what the compiler does with branching hints.

float clamp_minmax(float x) {
  return std::max(std::min(x, 65535.0f), 0.0f);
}

float clamp_hint(float x) {
  if (__builtin_expect(x > 65535, 0)) {
    return 65535;
  }
  if (__builtin_expect(x < 0, 0)) {
    return 0;
  }
  return x;
}

GCC basically doesn't care, but Clang interestingly produces different code.

clamp_minmax(float):
        movss   xmm1, dword ptr [rip + .LCPI0_0]
        minss   xmm1, xmm0
        xorps   xmm0, xmm0
        maxss   xmm0, xmm1
        ret

clamp_hint(float):
        movaps  xmm1, xmm0
        movss   xmm0, dword ptr [rip + .LCPI1_0]
        ucomiss xmm1, xmm0
        ja      .LBB1_2
        xorps   xmm0, xmm0
        maxss   xmm0, xmm1
.LBB1_2:
        ret

ucomiss -> ja is not faster than a single minss, but if you can jump over maxss most of the time by branching, the branched version will probably run faster than the branchless version because you can skip the latency of maxss which is unavoidable in the branchless version due to its dependency on the previous minss.

See also, (What is the instruction that gives branchless FP min and max on x86?).

The `ucomiss/ja` pair is faster if the branch is easy to predict as it avoids a data dependency on the lower of the two values. — fuz, Jun 25 '22 at 19:01
@fuz I edited my answer after reading "MINSS and MAXSS are faster than anything you could do with a branch anyway" ([1](https://stackoverflow.com/a/40199125/17665807)). — xiver77, Jun 25 '22 at 19:05
MINSS/MAXSS are great for throughput, but latency bottlenecks can be a problem in some code if they're part of a long dep chain that OoO exec can't (fully) hide. That statement in my answer of "faster than anything you could do with a branch" is assuming a branch would be somewhat unpredictable, and/or mostly considering throughput. — Peter Cordes, Jun 25 '22 at 23:28

Why does gcc implement fmin and fmax in three different ways?

1 Answers1