How do I get the sign of an intel Architecture SIMD __m128

Question

Since "_mm_sign_ps" does not exist as far as I have been able to find: Given an __m128 value with four floating-point values, which SIMD instruction or list of SIMD instructions would convert it to an __m128 value with four floating-point values containing either:

+1, if that original value of the four is positive and greater than zero. 0, if that original value of the four is zero. -1, if that original value of the four is negative and less than zero.

score 3 · Answer 1 · answered Jan 21 '18 at 04:54

SSE really doesn't match this very well at all. First, the comparison functions don't result in ±1.0f, but rather all bits being set if the condition is true, or none of them set if the condition is false. Also, you're asking for a three-way comparison where the result is "zero" if the value is "zero" ("zero" is in quotes because you don't actually specify whether you want positive or negative zero; IEEE 754 has both). You'll be much better off if you can re-frame the problem to better match what SSE provides.

That said:

__m128 foo (__m128 value) {
  const __m128 zero = _mm_set_ps1 (0.0f);

  __m128 positives = _mm_and_ps(_mm_cmpgt_ps (value, zero), _mm_set_ps1(1.0f));
  __m128 negatives = _mm_and_ps(_mm_cmplt_ps (value, zero), _mm_set_ps1(-1.0f));

  return _mm_or_ps(positives, negatives);
}

I don't know what you're planning on using this for, but if you're comfortable with bitwise operations then there is a good chance you can figure out how to just use a single _mm_cmpgt_ps, _mm_cmpge_ps, _mm_cmplt_ps, or _mm_cmple_ps.

In order to divide the number line into three equivalence classes, a minimum of two comparisons will be needed. — Ben Voigt, Jan 21 '18 at 05:16
Yes, that's why I have two comparisons in the answer. However, it's not usually necessary to treat zero differently, so if you're careful about how you approach the problem it may be possible to simplify everything down to a single comparison, possibly even one where you don't need to convert 0xffffffff to ±1.0f. — nemequ, Jan 21 '18 at 05:20
My guess was they might want to multiply something by this, (conditional sign flip or zero), in which case it only takes a few booleans and one cmpeq_ps. — Peter Cordes, Jan 21 '18 at 08:08

Peter Cordes · Answer 2 · 2022-11-15T06:04:46.533

SSE doesn't naturally / efficiently work this way for float/double. What exactly do you want to do with the -1.0f / 0.0f / 1.0f sgn(x) value?

You should probably optimize out the step of actually having those FP values in a register, and work directly with compare mask results. The question you're asking is a sign of an X-Y problem. Yes you could actually implement this, but usually you shouldn't.

For example, you could boolean AND or compare+AND to get a mask of the sign bits, and then maybe boolean XOR (_mm_xor_ps()) to flip the sign bits in another vector where those bits were set, and to leave unchanged the elements where the sign bit was unset in the corresponding element.

(FP negation is as simple as flipping the sign bit, because IEEE-754 binary formats use a sign/magnitude representation.)

But be careful of -0.0, because it has the sign bit set. If you want to zero elements based on the corresponding element being zero, and flip or not for others, you could use a couple boolean operations and then mask the result with the result of _mm_cmpeq_ps against 0.0. (Which will be true for 0.0 and -0.0). Or compare against -0.0f, if you already have that constant for something else.

For example:

// SSE2  v * sgn(src), except we treat src=NaN as src=0

__m128 mul_by_signum(__m128 v, __m128 src)
{
    __m128 minus_zero = _mm_set1_ps(-0.0);  // epi32(1U<<31)
    __m128 signbits = _mm_and_ps(src, minus_zero);
    __m128 flipped = _mm_xor_ps(v, signbits);

    // reuse the zero constant we already have, maybe saving an instruction
    __m128 nonzero = _mm_cmpneq_ps(src, minus_zero);
    return _mm_and_ps(flipped, nonzero);
}

Comparing against minus_zero instead of _mm_setzero_ps() lets the compiler reuse the same constant, saving an instruction. At least if AVX is enabled, otherwise it needs an extra movaps to copy a register instead of an xorps zeroing instruction. (Godbolt). Clang compares against +0.0 instead of the constant it loaded, even if that costs an extra instruction without saving any. (It does mean the compare doesn't have to wait for load latency from the constant.)

For integer, there's SSSE3 psignb/w/d, which will preserve / zero / negate elements in the destination based on the source being positive / zero / negative. With an destination of _mm_set1_epi32(1), it would give you a vector of 1/0/-1 elements.

You can't usefully use it on FP data, because FP uses sign/magnitude instead of 2's complement. And because it checks for integer zero, so -0.0 would look like a negative number.

BTW, you didn't mention what you want to happen for NaN FP inputs. Don't forget that FP comparisons have 4 possible results: above/equal/below, or unordered if one or both operand is a NaN. (So for comparison against zero, you can have positive, zero, negative, or NaN).

Unfortunately, you're right about the X-Y problem. What I actually want to do is normalize a set of SOA vectors using SIMD, some of which start at [0, 0, 0], but ending without any +/-Inf or NaN results. The sign question was just part of the convoluted solution I came up with. Should I edit this question, or make a new one? — Narf the Mouse, Jan 22 '18 at 10:26
@NarftheMouse: Oh, that's relatively easy. `nonzero_magnitude = _mm_cmpne_ps(sum_of_squares, 0.0);`, then use that to mask the division (or `rsqrtps`) result to 0.0 for the elements that produce 0 / 0. So it just takes one extra `cmpps` and one extra `andps` to zero out elements where the inputs produce an Inf or NaN. As a bonus, comparing with an early temporary result instead of checking the final result for NaN takes the compare off the latency critical path. I think that's been asked before (at least something with that general idea) , but sure ask a new question if you can't find it. — Peter Cordes, Jan 22 '18 at 11:04

How do I get the sign of an intel Architecture SIMD __m128

2 Answers2

Linked