SSE doesn't naturally / efficiently work this way for float/double. What exactly do you want to do with the -1.0f
/ 0.0f
/ 1.0f
sgn(x) value?
You should probably optimize out the step of actually having those FP values in a register, and work directly with compare mask results. The question you're asking is a sign of an X-Y problem. Yes you could actually implement this, but usually you shouldn't.
For example, you could boolean AND or compare+AND to get a mask of the sign bits, and then maybe boolean XOR (_mm_xor_ps()
) to flip the sign bits in another vector where those bits were set, and to leave unchanged the elements where the sign bit was unset in the corresponding element.
(FP negation is as simple as flipping the sign bit, because IEEE-754 binary formats use a sign/magnitude representation.)
But be careful of -0.0
, because it has the sign bit set. If you want to zero elements based on the corresponding element being zero, and flip or not for others, you could use a couple boolean operations and then mask the result with the result of _mm_cmpeq_ps
against 0.0. (Which will be true for 0.0 and -0.0). Or compare against -0.0f
, if you already have that constant for something else.
For example:
// SSE2 v * sgn(src), except we treat src=NaN as src=0
__m128 mul_by_signum(__m128 v, __m128 src)
{
__m128 minus_zero = _mm_set1_ps(-0.0); // epi32(1U<<31)
__m128 signbits = _mm_and_ps(src, minus_zero);
__m128 flipped = _mm_xor_ps(v, signbits);
// reuse the zero constant we already have, maybe saving an instruction
__m128 nonzero = _mm_cmpneq_ps(src, minus_zero);
return _mm_and_ps(flipped, nonzero);
}
Comparing against minus_zero
instead of _mm_setzero_ps()
lets the compiler reuse the same constant, saving an instruction. At least if AVX is enabled, otherwise it needs an extra movaps
to copy a register instead of an xorps
zeroing instruction. (Godbolt). Clang compares against +0.0
instead of the constant it loaded, even if that costs an extra instruction without saving any. (It does mean the compare doesn't have to wait for load latency from the constant.)
For integer, there's SSSE3 psignb/w/d
, which will preserve / zero / negate elements in the destination based on the source being positive / zero / negative. With an destination of _mm_set1_epi32(1)
, it would give you a vector of 1/0/-1 elements.
You can't usefully use it on FP data, because FP uses sign/magnitude instead of 2's complement. And because it checks for integer zero, so -0.0
would look like a negative number.
BTW, you didn't mention what you want to happen for NaN FP inputs. Don't forget that FP comparisons have 4 possible results: above/equal/below, or unordered if one or both operand is a NaN. (So for comparison against zero, you can have positive, zero, negative, or NaN).