If it's ok for sgn(-0.0f)
to produce an output of -0.0f
instead of +0.0f
, you can save an instruction or two compared to @Cory Nelson's version. See below for a version which also propagates NaN.
- select 0.0 or 1.0 based on a compare for
x != 0.0f
- copy the sign bit of
x
to the that.
// return -0.0 for x=-0.0, otherwise the same as Cory's (except for NaN which neither handle well)
__m128 sgn_fast(__m128 x)
{
__m128 negzero = _mm_set1_ps(-0.0f);
// using _mm_setzero_ps() here might actually be better without AVX, since xor-zeroing is as cheap as a copy but starts a new dependency chain
//__m128 nonzero = _mm_cmpneq_ps(x, negzero); // -0.0 == 0.0 in IEEE floating point
__m128 nonzero = _mm_cmpneq_ps(x, _mm_setzero_ps());
__m128 x_signbit = _mm_and_ps(x, negzero);
__m128 zeroone = _mm_and_ps(nonzero, _mm_set1_ps(1.0f));
return _mm_or_ps(zeroone, x_signbit);
}
When the input is NaN, I think it returns +/-1.0f, according to the sign of the NaN. (Since _mm_cmpneq_ps()
is true when x is NaN: see the table on the CMPPD
instruction).
Without AVX, this is two fewer instructions than Cory's version (with clang3.9 on the Godbolt compiler explorer). When inlined into a loop, the memory source operands could be register source operands. gcc uses more instructions, doing a separate MOVAPS load and painting itself into a corner that requires an extra MOVAPS to get the return value into xmm0.
xorps xmm1, xmm1
cmpneqps xmm1, xmm0
andps xmm0, xmmword ptr [rip + .LCPI0_0] # x_signbit
andps xmm1, xmmword ptr [rip + .LCPI0_1] # zeroone
orps xmm0, xmm1
The critical-path latency is cmpneqps
+ andps
+ orps
, which is 3+1+1 cycles on Intel Haswell for example. Cory's version needs to run two cmpps
instructions in parallel to achieve that latency, which is only possible on Skylake. Other CPUs will have a resource conflict causing an extra cycle of latency.
To propagate NaN, so the possible outputs would be -1.0f
, -/+0.0f
, 1.0f
, and NaN
, we could take advantage of the fact that the all-ones bit pattern is a NaN.
_mm_cmpunord_ps(x,x)
to get a NaN-mask. (Or equivalently, cmpneqps)
or
that onto the result to leave it unmodified or force it to NaN.
// return -0.0 for x=-0.0. Return -NaN for any NaN
__m128 sgn_fast_nanpropagating(__m128 x)
{
__m128 negzero = _mm_set1_ps(-0.0f);
__m128 nonzero = _mm_cmpneq_ps(x, _mm_setzero_ps());
__m128 x_signbit = _mm_and_ps(x, negzero);
__m128 nanmask = _mm_cmpunord_ps(x,x);
__m128 x_sign_or_nan = _mm_or_ps(x_signbit, nanmask); // apply it here instead of to the final result for better ILP
__m128 zeroone = _mm_and_ps(nonzero, _mm_set1_ps(1.0f));
return _mm_or_ps(zeroone, x_sign_or_nan);
}
This compiles efficiently, and barely lengthens the critical path latency. It does take more MOVAPS instructions to copy registers without AVX, though.
You might be able to do something useful with SSE4.1 BLENDVPS, but it's not the most efficient instruction on all CPUs. It's also hard to avoid treating negative zero as non-zero.
If you want an integer result, you can use SSSE3 _mm_sign_epi32(set1(1), x)
to get a -1, 0, or 1 output. If -0.0f -> -1
is too sloppy, you can fix that up by ANDing with the result of _mm_cmpneq_ps(x, _mm_setzero_ps())
// returns -1 for x = -0.0f
__m128i sgn_verysloppy_int_ssse3(__m128 x) {
__m128i one = _mm_set1_epi32(1);
__m128i sign = _mm_sign_epi32(one, _mm_castps_si128(x));
return sign;
}
// correct results for all inputs
// NaN -> -1 or 1 according to its sign bit, never 0
__m128i sgn_int_ssse3(__m128 x) {
__m128i one = _mm_set1_epi32(1);
__m128i sign = _mm_sign_epi32(one, _mm_castps_si128(x));
__m128 nonzero = _mm_cmpneq_ps(x, _mm_setzero_ps());
return _mm_and_si128(sign, _mm_castps_si128(nonzero));
}