I have a few routines here that all do the same thing: they clamp a float to the range [0,65535]. What surprises me is that the compiler (gcc -O3) uses three, count 'em, three different ways to implement float-min and float-max. I'd like to understand why it generates three different implementations. Ok, here's the C++ code:
float clamp1(float x) {
x = (x < 0.0f) ? 0.0f : x;
x = (x > 65535.0f) ? 65535.0f : x;
return x;
}
float clamp2(float x) {
x = std::max(0.0f, x);
x = std::min(65535.0f, x);
return x;
}
float clamp3(float x) {
x = std::min(65535.0f, x);
x = std::max(0.0f, x);
return x;
}
So here's the generated assembly (with some of the boilerplate removed). Reproducible on https://godbolt.org/z/db775on4j with GCC10.3 -O3
. (Also showing clang14's choices.)
CLAMP1:
movaps %xmm0, %xmm1
pxor %xmm0, %xmm0
comiss %xmm1, %xmm0
ja .L9
movss .LC1(%rip), %xmm0 # 65535.0f
movaps %xmm0, %xmm2
cmpltss %xmm1, %xmm2
andps %xmm2, %xmm0
andnps %xmm1, %xmm2
orps %xmm2, %xmm0
.L9:
ret
CLAMP2:
pxor %xmm1, %xmm1
comiss %xmm1, %xmm0
ja .L20
pxor %xmm0, %xmm0
ret
.L20:
minss .LC1(%rip), %xmm0 # 65535.0f
ret
CLAMP3:
movaps %xmm0, %xmm1
movss .LC1(%rip), %xmm0 # 65535.0f
comiss %xmm1, %xmm0
ja .L28
ret
.L28:
maxss .LC2(%rip), %xmm1 # 0.0f
movaps %xmm1, %xmm0
ret
So there appear to be three different implementations of MIN and MAX here:
- using compare-and-branch
- using
minss
andmaxss
- using compare,
andps
,andnps
, andorps
.
Can somebody clarify the following:
- Are these all the same speed, or is one of them faster?
- How does the compiler end up choosing all these different implementations?
- What exactly is that thing with the
andps
,andnps
, and so forth? - Would using both
minss
andmaxss
, and no branches, be faster?