6

1) Is there a way to efficiently implement sign function using SSE3 (no SSE4) with the following characteristics?

  • the input is a float vector __m128.
  • the output should be also __m128 with [-1.0f, 0.0f, 1.0f] as its values

I tried this, but it didn't work (though I think it should):

inputVal = _mm_set_ps(-0.5, 0.5, 0.0, 3.0);
comp1 = _mm_cmpgt_ps(_mm_setzero_ps(), inputVal);
comp2 = _mm_cmpgt_ps(inputVal, _mm_setzero_ps());
comp1 = _mm_castsi128_ps(_mm_castps_si128(comp1));
comp2 = _mm_castsi128_ps(_mm_castps_si128(comp2));
signVal = _mm_sub_ps(comp1, comp2);

2) Is there a way to create a "flag" function (I'm not sure about the right name). Namely, if A > B the result will be 1 and 0 otherwise. The result should be floating-point (__m128) just like its input.

UPDATE: It seems Cory Nelson answer will work here:

__m128 greatherThanFlag = _mm_and_ps(_mm_cmpgt_ps(valA, valB), _mm_set1_ps(1.0f));    
__m128 lessThanFlag = _mm_and_ps(_mm_cmplt_ps(valA, valB), _mm_set1_ps(1.0f));
plasmacel
  • 8,183
  • 7
  • 53
  • 101
Royi
  • 4,640
  • 6
  • 46
  • 64
  • For part 2, why would you want a `1.0f` flag, instead of just the AND mask from `_mm_cmpgt_ps`? e.g. to conditionally add something, mask the operand with the compare result to zero elements where the compare is false. all-zero bits = 0.0f, which is the additive identity. Or if you just want to make something 0 or unchanged, AND it with the mask. – Peter Cordes Dec 28 '16 at 07:03
  • @PeterCordes, I'm not sure I understand you (I'm not that experienced with bits :-)). Would you write a solution to part 2? The more efficient, the better :-). – Royi Dec 28 '16 at 07:34
  • If that's what you want, that is the most efficient way to get it. But my point is that producing a `0.0` or `1.0` result is probably not useful in the first place. What do you want to do with it? – Peter Cordes Dec 28 '16 at 07:39
  • 1
    You could look at the code generated by g++ with `(A>B)?1.f:0.f` to get some ideas... (cmpltps+andps) – Marc Glisse Dec 28 '16 at 10:38
  • @PeterCordes, Because than one could easily create "IF Interpolation". Something like `a = flagA * a + (1 - flagA) * b`. Something like masking with no SSE 4. – Royi Jan 01 '17 at 07:10
  • 2
    /facepalm. I was afraid you wanted it for something horrible like that. Using FP multiply and add is a really slow way to implement `condition ? a : b`. Use the usual ANDPS/ANDNPS/ORPS sequence to blend based on a compare result if you don't have SSE4 BLENDPS. See the [Branchless “select” `(cond ? a : b)` section in this blog post](https://fgiesen.wordpress.com/2016/04/03/sse-mind-the-gap/) – Peter Cordes Jan 01 '17 at 07:16
  • 2
    If you want to negate something, turn the mask into a vector with just the sign bits (conditionally) set, and XOR with that to flip the sign bits of another vector. (Remember that XORing with `0` has no effect, it's the identity value. XORing with `1` flips a bit. See also [SSE absolute value](http://stackoverflow.com/questions/32408665/fastest-absolute-value-calculator-using-sse) for more about manipulating sign bits with booleans. – Peter Cordes Jan 01 '17 at 07:22

4 Answers4

8

First that comes to mind is perhaps the simplest:

__m128 sign(__m128 x)
{
    __m128 zero = _mm_setzero_ps();

    __m128 positive = _mm_and_ps(_mm_cmpgt_ps(x, zero), _mm_set1_ps(1.0f));
    __m128 negative = _mm_and_ps(_mm_cmplt_ps(x, zero), _mm_set1_ps(-1.0f));

    return _mm_or_ps(positive, negative);
}

Or, if you misspoke and intended to get an integer result:

__m128i sign(__m128 x)
{
    __m128 zero = _mm_setzero_ps();

    __m128 positive = _mm_and_ps(_mm_cmpgt_ps(x, zero),
                                 _mm_castsi128_ps(_mm_set1_epi32(1)));
    __m128 negative = _mm_cmplt_ps(x, zero);

    return _mm_castps_si128(_mm_or_ps(positive, negative));
}
Cory Nelson
  • 29,236
  • 5
  • 72
  • 110
  • You can make a slightly-sloppy version of this using only one `_mm_cmpneq_ps(x, zero)`, and then copying the sign bit. See my answer for a version that returns `-0.0` when the input is `-0.0`, but otherwise returns the same thing for non-NaN inputs. – Peter Cordes Dec 27 '16 at 23:27
  • As plasmacel's integer version points out, the all-ones compare result is already an integer `-1`. The compiler will hopefully optimize away the `_mm_and_ps` with `set1_epi32(-1)`. And interestingly, you can turn a compare result into an integer 0 or 1 with a logical right shift so you don't need a constant there either. (But that cast is likely to have a bypass-delay). Anyway, you should be using `_mm_and_si128` and `_mm_or_si128` in your integer version. – Peter Cordes Dec 28 '16 at 01:46
  • @PeterCordes integer version keeps it in floating-point domain the entire way through to avoid pipeline stalls that can come with re-interpreting a register. – Cory Nelson Dec 28 '16 at 04:29
  • Right, but there's going to be some bypass-delay extra latency somewhere when the result of this function is used as an integer vector. Or not, since some instructions on Intel SnB-family seem to be usable without extra latency on integer or FP data. e.g. I think Agner Fog says boolean ops like `_mm_or_ps` have no extra latency when their output is used as the input to an integer operation. For your FP version, where both compare results go through an ANDPS, it's probably optimal to still use `_mm_and_ps`, but then use `_mm_or_si128` on those results, giving an integer-domain result for free. – Peter Cordes Dec 28 '16 at 04:36
  • Integer-domain boolean ops can run on any port on Intel CPUs, but FP-domain booleans only run on port 5 on pre-Skylake, so integer-domain booleans have better throughput. See [my answer here](http://stackoverflow.com/questions/26942952/difference-between-the-avx-instructions-vxorpd-and-vpxor). Also, hmm, I see that `orps` has no extra latency when used with integer inputs, but I'm not sure about the other way around on SnB-family. `por` might have extra latency when used on the results of `andps`. And I think you're right that `cmpps` produces its 0/-1 result in the FP domain. – Peter Cordes Dec 28 '16 at 04:41
6

If it's ok for sgn(-0.0f) to produce an output of -0.0f instead of +0.0f, you can save an instruction or two compared to @Cory Nelson's version. See below for a version which also propagates NaN.

  • select 0.0 or 1.0 based on a compare for x != 0.0f
  • copy the sign bit of x to the that.

// return -0.0 for x=-0.0, otherwise the same as Cory's (except for NaN which neither handle well)
__m128 sgn_fast(__m128 x)
{
    __m128 negzero = _mm_set1_ps(-0.0f);

    // using _mm_setzero_ps() here might actually be better without AVX, since xor-zeroing is as cheap as a copy but starts a new dependency chain
    //__m128 nonzero = _mm_cmpneq_ps(x, negzero);  // -0.0 == 0.0 in IEEE floating point
    __m128 nonzero = _mm_cmpneq_ps(x, _mm_setzero_ps());

    __m128 x_signbit = _mm_and_ps(x, negzero);

    __m128 zeroone = _mm_and_ps(nonzero, _mm_set1_ps(1.0f));
    return _mm_or_ps(zeroone, x_signbit);
}

When the input is NaN, I think it returns +/-1.0f, according to the sign of the NaN. (Since _mm_cmpneq_ps() is true when x is NaN: see the table on the CMPPD instruction).

Without AVX, this is two fewer instructions than Cory's version (with clang3.9 on the Godbolt compiler explorer). When inlined into a loop, the memory source operands could be register source operands. gcc uses more instructions, doing a separate MOVAPS load and painting itself into a corner that requires an extra MOVAPS to get the return value into xmm0.

    xorps   xmm1, xmm1
    cmpneqps        xmm1, xmm0
    andps   xmm0, xmmword ptr [rip + .LCPI0_0]    # x_signbit
    andps   xmm1, xmmword ptr [rip + .LCPI0_1]    # zeroone
    orps    xmm0, xmm1

The critical-path latency is cmpneqps + andps + orps, which is 3+1+1 cycles on Intel Haswell for example. Cory's version needs to run two cmpps instructions in parallel to achieve that latency, which is only possible on Skylake. Other CPUs will have a resource conflict causing an extra cycle of latency.


To propagate NaN, so the possible outputs would be -1.0f, -/+0.0f, 1.0f, and NaN, we could take advantage of the fact that the all-ones bit pattern is a NaN.

  • _mm_cmpunord_ps(x,x) to get a NaN-mask. (Or equivalently, cmpneqps)
  • or that onto the result to leave it unmodified or force it to NaN.

// return -0.0 for x=-0.0.  Return -NaN for any NaN
__m128 sgn_fast_nanpropagating(__m128 x)
{
    __m128 negzero = _mm_set1_ps(-0.0f);
    __m128 nonzero = _mm_cmpneq_ps(x, _mm_setzero_ps());

    __m128 x_signbit = _mm_and_ps(x, negzero);
    __m128 nanmask   = _mm_cmpunord_ps(x,x);
    __m128 x_sign_or_nan = _mm_or_ps(x_signbit, nanmask);   // apply it here instead of to the final result for better ILP

    __m128 zeroone = _mm_and_ps(nonzero, _mm_set1_ps(1.0f));
    return _mm_or_ps(zeroone, x_sign_or_nan);
}

This compiles efficiently, and barely lengthens the critical path latency. It does take more MOVAPS instructions to copy registers without AVX, though.


You might be able to do something useful with SSE4.1 BLENDVPS, but it's not the most efficient instruction on all CPUs. It's also hard to avoid treating negative zero as non-zero.


If you want an integer result, you can use SSSE3 _mm_sign_epi32(set1(1), x) to get a -1, 0, or 1 output. If -0.0f -> -1 is too sloppy, you can fix that up by ANDing with the result of _mm_cmpneq_ps(x, _mm_setzero_ps())

// returns -1 for x = -0.0f
__m128i sgn_verysloppy_int_ssse3(__m128 x) {
  __m128i one = _mm_set1_epi32(1);
  __m128i sign = _mm_sign_epi32(one, _mm_castps_si128(x));
  return sign;
}

// correct results for all inputs
// NaN -> -1 or 1 according to its sign bit, never 0
__m128i sgn_int_ssse3(__m128 x) {
  __m128i one = _mm_set1_epi32(1);
  __m128i sign = _mm_sign_epi32(one, _mm_castps_si128(x));

  __m128  nonzero = _mm_cmpneq_ps(x, _mm_setzero_ps());
    return _mm_and_si128(sign, _mm_castps_si128(nonzero));
}
Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
  • I came up with a similar answer for integer results but without SSSE3. – plasmacel Dec 28 '16 at 00:45
  • Great answer! I +1 it. I don't mind `-0.0f` to become `-0.0f`. But it can't be `-1f`. How would you do the second part the most efficiently? What If SSE3 is allowed? – Royi Dec 28 '16 at 07:31
  • 1
    I just realized now that for `-0.0f` your solution produces `-0.0f`, not `-1.0f`. If the result will be used in floating-point operations, it's not worse than Cory's solution at all. Even better, it preserves the signbit of all floating-point values, while producing correct mathematical results for the sign. Actually, it is a more IEEE-754 compliant solution, I like it. – plasmacel Dec 29 '16 at 00:09
  • @plasmacel: Neither one propagates NaNs, though. Since the all-ones bit-pattern is a NaN, we could maybe use a `cmpneqps` of x with itself to get a NaN-mask, and OR that onto the result. – Peter Cordes Dec 29 '16 at 00:15
  • @PeterCordes That would destroy the performance gain of this solution. On the other hand there are several piecewise defined `std` functions which also don't propagate `NaN`s at the first place (like `std::fmin`/˙`std::fmax`). Maybe this behavior could be controlled by a conditional preprocessor to support the rare cases when `NaN` propagation is necessary. – plasmacel Dec 29 '16 at 00:30
  • 1
    @plasmacel: See my update. Critical-path latency should barely increase on most CPUs, since the NaN check can happen in parallel. And applying it to `x_signbit` instead of the final result means more ILP. Cory's version doesn't propagate NaN either (I think returning 0.0 for NaN), so this is about the same perf as that but with NaN handling. I did of course leave the higher-perf version that returns -/+1.0f according to the sign of the NaN. – Peter Cordes Dec 29 '16 at 00:38
  • @Royi: None of my `sgn` versions return `-1.0f` for `-0.0f`. Only the verysloppy integer version returns `(int)-1` for `-0.0f`, so don't use that. – Peter Cordes Dec 29 '16 at 00:39
  • @PeterCordes I also suggest to remove the "sloppy" diminutive suffix from their names - these are the preferred way of implementing sign with floating-point results. – plasmacel Dec 29 '16 at 02:04
  • @plasmacel: ok, if you say so. I thought 4 possible results was less desirable than 3. Updated to "fast" instead of "sloppy" :P – Peter Cordes Dec 29 '16 at 02:11
4

If you need a signum function for float vectors, where the result is a int32_t vector, and you don't care about NaNs, then a more efficient version can be implemented using integer instructions, based on the following theory.

If you take a floating-point number and reinterpret the bits as a signed twos complement integer, you can get 3 distinct cases (where X is an arbitrary 0 or 1, and the bold MSB is the sign bit):

  • 0 X X X X X X X X X X X X X X 1, which is > 0 (or > 0.0f as float)
  • 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0, which is == 0 (or == 0.0f as float)
  • 1 X X X X X X X X X X X X X X X, which is < 0 (or <= 0.0f as float)

The last case is ambiguous, since it can be the special floating-point case of negative zero -0.0f:

  • 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0, which is == -0.0f == 0.0f as float

From this point the floating-point signum function turns into an integer function.


Using intrinsics available with SSE3 (not SSSE3) this can be implemented as:

inline __m128i _mm_signum_ps(__m128 a)
{
    __m128i x = _mm_castps_si128(a);

    __m128i zero = _mm_setzero_si128();
    __m128i m0 = _mm_cmpgt_epi32(x, zero);
    __m128i m1 = _mm_cmplt_epi32(x, zero);
    __m128i m2 = _mm_cmpeq_epi32(x, _mm_set1_epi32(0x80000000));

    __m128i p = _mm_and_si128(m0, _mm_set1_epi32(+1));

    // Note that since (-1 == 0xFFFFFFFF) in two's complement,
    // n satisfies (n == m1), so the below line is strictly semantic
    // __m128i n = _mm_and_si128(m1, _mm_set1_epi32(-1));
    __m128i n = m1;

    return _mm_andnot_si128(m2, _mm_or_si128(p, n));
}

The optimized version of this is

inline __m128i _mm_signum_ps(__m128 a)
{
    __m128i x = _mm_castps_si128(a);

    __m128i zr = _mm_setzero_si128();
    __m128i m0 = _mm_cmpeq_epi32(x, _mm_set1_epi32(0x80000000));
    __m128i mp = _mm_cmpgt_epi32(x, zr);
    __m128i mn = _mm_cmplt_epi32(x, zr);

    return _mm_or_si128(
      _mm_andnot_si128(m0, mn),
      _mm_and_si128(mp, _mm_set1_epi32(1))
    );
}

As Peter suggested in the comments, using one floating-point comparison _mm_cmplt_ps instead of two integer comparison _mm_cmplt_epi32/_mm_cmpeq_epi32 to handle -0.0f saves 1 latency, but it might suffers from bypass delay latency because of switching between floating-point/integer domains, so it's maybe better to stick with the integer only implementation above. Or not. Since you need an integer result it's more likely that you will use it and swap to integer domain anyway. So:

inline __m128i _mm_signum_ps(__m128 a)
{
    __m128i x = _mm_castps_si128(a);
    __m128 zerops = _mm_setzero_ps();

    __m128i mn = _mm_castps_si128(_mm_cmplt_ps(a, zerops));
    __m128i mp = _mm_cmpgt_epi32(x, _mm_castps_si128(zerops));

    return _mm_or_si128(mn, _mm_and_si128(mp, _mm_set1_epi32(1)));
}

With -march=x86-64 -msse3 -O3 in clang 3.9 this compiles to

_mm_signum_ps(float __vector(4)):                # @_mm_signum2_ps(float __vector(4))
        xorps   xmm1, xmm1                       # fp domain
        movaps  xmm2, xmm0                       # fp domain
        cmpltps xmm2, xmm1                       # fp domain
        pcmpgtd xmm0, xmm1                       # int domain
        psrld   xmm0, 31                         # int domain
        por     xmm0, xmm2                       # int domain
        ret

Except cmpltps, the latency of each instruction here is 1 with throughputs <= 1. I think it is a really efficient solution, and it can be further improved with SSSE3's _mm_sign_epi32.


If you need floating-point results, it is better to stay entirely in floating-point domain (instead of swapping between floating-point/integer domains) so use one of Peter's solutions.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
plasmacel
  • 8,183
  • 7
  • 53
  • 101
  • I think if you want an integer result without SSSE3 `psignd`, it might be faster to still use cmpps instead of having to special-case `-0.0`. Using shifts instead of (or after) compares is a neat trick, though, and saves some integer constants. – Peter Cordes Dec 28 '16 at 01:34
  • No idea what you're talking about with wishing the compiler used `pcmpgtd` instructions to compare against zero. On Haswell, `psrad` runs on p0, while `pcmpgtd` runs on p15. Can you explain why you think there's any lost instruction-level-parallelism here? I didn't look super-closely, so maybe there's something, in which case you should explain it in your answer. – Peter Cordes Dec 28 '16 at 01:40
  • @PeterCordes No, you are right. I incorrectly thought that they run on the same ports. At the comparison stage I focused on executing 2x`pcmpgtd` in parallel on the same port, leaving the other ports free. In this case both compiled version is equivalent in performance, if it doesn't take away p0 from an another instruction. – plasmacel Dec 28 '16 at 02:45
  • @PeterCordes Btw I tried `cmpltps` instead of two integer comparison to handle special case `-0.0f`. It saves 3 instructions, and introduces 2 additional latency. So it saves 1 latency, yeah. – plasmacel Dec 28 '16 at 02:53
  • Ah, yeah I wondered if you were getting lower latency by using only integer compares, even though it takes more instructions. Neat. As long as bypass-delays don't bite you and eat up all the gains... – Peter Cordes Dec 28 '16 at 03:00
  • @PeterCordes Yeah, if I have some time to waste I will do some tests which one performs better in a real world scenario. – plasmacel Dec 28 '16 at 03:06
  • There are many different possible real-world scenarios... Some of them favour latency, some of them favour fused-domain uop count, and others bottleneck on a certain port so reducing pressure on that port is a win. And then there's the question of which operands are in registers, and register pressure for holding constants... It's generally not useful to try to invent a "real-world microbenchmark" because that's an oxymoron. IDK, it might be neat to try to see how they compare when the surrounding code and use-case bottlenecks on different things, but it will be time-consuming to do well. – Peter Cordes Dec 28 '16 at 03:19
  • @PeterCordes Actually, as you also pointed out if you once need an integer result, it's more likely that you want to use it and then you swap to integer domain anyway. In this sense the mixed domain implementation with `cmpltps` is a better choice (or at least not worse). Otherwise the integer only implementation is the better one. – plasmacel Dec 28 '16 at 06:58
  • Great answer. I +1 it. I use it for a flag, namely the result will be multiplied by a floating SSE vector, so should I have the result in float? How would you handle the second part? Thank You. – Royi Dec 28 '16 at 07:26
  • @Royi If you multiply the result with a float (`__m128`) vector, then yes, you need a floating-point result, and in that case I suggest to use Cory's 1st solution `__m128 sign(__m128 x)`. Peter already made a comment on your question asking for the reason of your second question. It's hard to say anything clever before you explain what do you want to do with that "flag" vector. – plasmacel Dec 28 '16 at 07:35
  • @plasmacel, Thank you for your great comments. I added my reasoning for part II. Thank You. – Royi Jan 01 '17 at 07:12
  • It strikes me that the line `__m128i n = _mm_and_si128(m1, _mm_set1_epi32(-1));` in the first code snippet is redundant, because `-1 = 0xffff_ffff`, so that doing a bitwise `AND` with any other number yields that number. Just for posterity, if case somebody stumbles over the same thing. – mSSM Nov 21 '18 at 11:36
  • @mSSM Good catch. However there is an optimized version below the snippet you mention. The first version shows the idea, the second one is optimized for production. I will give a comment for that line tho. – plasmacel Nov 21 '18 at 17:47
1

You're close, but your code doesn't quite work because you are trying to convert a 0/-1 int to float using only a cast.

Try this (untested):

inputVal = _mm_set_ps(-0.5, 0.5, 0.0, 3.0);
comp1 = _mm_cmpgt_ps(_mm_setzero_ps(), inputVal);
comp2 = _mm_cmpgt_ps(inputVal, _mm_setzero_ps());
comp1 = _mm_cvtepi32_ps(_mm_castps_si128(comp1)); // 0/-1 => 0.0f/-1.0f
comp2 = _mm_cvtepi32_ps(_mm_castps_si128(comp2));
signVal = _mm_sub_ps(comp1, comp2);

Having said that, I think Cory's solution is probably more efficient.

Community
  • 1
  • 1
Paul R
  • 208,748
  • 37
  • 389
  • 560
  • It says something about "No suitable user defined conversion from _m128 to _mm128i exists. – Royi Dec 24 '16 at 17:43
  • Ah - some compilers will need a cast (which compiler are you using BTW ?) - I'll fix it shortly. – Paul R Dec 24 '16 at 17:45
  • I'm using VS 2015. – Royi Dec 24 '16 at 17:48
  • OK - Visual Studio is always more awkward when writing SSE code - I've added casts now and it should work. – Paul R Dec 24 '16 at 17:52
  • Paul R, Thank You +1. – Royi Dec 24 '16 at 20:46
  • What compiler did not require `_mm_castps_si128`? – Z boson Dec 27 '16 at 17:28
  • 1
    @Zboson: I think it's earlier versions of gcc (up to 4.something), or more recent versions of gcc with `-flax-vector-conversions`. – Paul R Dec 27 '16 at 20:26
  • Those casts can be very annoying. I updated my [8x8 transpose answer today](http://stackoverflow.com/a/25627536/2542702). I have lines like `r0 = _mm256_insertf128_ps(_mm256_castps128_ps256(_mm_load_ps(&mat[0*8+0])), _mm_load_ps(&mat[4*8+0]), 1);` Also having to cast `(__m256i*)` is also annoying. They fixed that with AVX512 but not totally because if you want to e.g. insert `__m256i` with AVX512 you still have to cast `(__m256i*)`. Intrinsics can be obnoxiously long and overly verbose just to mae the compiler happy. – Z boson Dec 27 '16 at 20:59
  • 1
    @Zboson: yes, I tend to isolate a lot of this stuff into custom macros or inline functions, otherwise it tends to get long-winded and unreadable, – Paul R Dec 27 '16 at 22:36