12

Is there any single instruction or function that can invert the sign of every float inside a __m128? i.e. a = r0:r1:r2:r3 ===> a = -r0:-r1:-r2:-r3?

I know this can be done by _mm_sub_ps(_mm_set1_ps(0.0),a), but isn't it potentially slow since _mm_set1_ps(0.0) is a multi-instruction function?

Antonio
  • 19,451
  • 13
  • 99
  • 197
Bob Fang
  • 6,963
  • 10
  • 39
  • 72
  • 2
    Possible duplicate of [Flipping sign on packed SSE floats](http://stackoverflow.com/questions/3361132/flipping-sign-on-packed-sse-floats) – Antonio Mar 11 '16 at 13:20

1 Answers1

31

In practice your compiler should do a good job of generating the constant vector for 0.0. It will probably just use _mm_xor_ps, and if your code is in a loop it should hoist the constant generation out of the loop anyway. So, bottom line, use your original idea of:

v = _mm_sub_ps(_mm_set1_ps(0.0), v);

or another common trick, which is:

v = _mm_xor_ps(v, _mm_set1_ps(-0.0));

which just flips the sign bits instead of doing a subtraction (not quite as safe as the first method, since it doesn't do the right thing with NaNs, but may be more efficient in some cases).

Paul R
  • 208,748
  • 37
  • 389
  • 560
  • 9
    I'm pretty sure `xor` is fine with NaN. The sign-bit in a NaN has don't-care status, so all NaNs stay as NaNs, and all non-NaNs stay non-NaN. Quiet vs. signalling NaN is indicated by the highest bit of the mantissa (not the highest bit of the whole float). Using `xor` is usually best. On AMD, where `xorps` runs in the integer domain and thus has a bypass delay to/from FP math instructions, it's still about the same latency as 5c `subps`. – Peter Cordes Mar 11 '16 at 13:20
  • 3
    Loading the `-0.0` constant from memory could cache-miss, though. Compilers don't like to [generate constants on the fly](http://stackoverflow.com/questions/35085059/what-are-the-best-instruction-sequences-to-generate-vector-constants-on-the-fly) if it takes more than one insn (`xorps same,same` or `pcmpeqw same,same` (all-ones)). This one just takes [`pcmpeqw xmm7,xmm7` / `pslld xmm7, 31`](http://stackoverflow.com/a/32422471/224132) (see that link for SSE absolute value: ANDN with that mask, or AND with its inverse) – Peter Cordes Mar 11 '16 at 13:25
  • Can't `_mm_set1_ps(0.0)` be replaced with `_mm_setzero_ps()`? – Marcin Poloczek Sep 21 '21 at 03:01
  • @MarcinPoliczek: yes, use whichever you prefer - the same code will be generated by the compiler in either case. – Paul R Sep 21 '21 at 07:12
  • Is there any way handle 0x80000000 (i.e -2147483648). I am expecting 2147483647 as result – Ram Kiran Dec 02 '21 at 07:24
  • Stick with `_mm_setzero_ps()` - it's really a 'synthetic' intrinsic that doesn't have an exact correspondence to a particular instruction. That means it conveys a *semantic* requirement in a modern compiler, and can often lead to more efficient code based on local code generation. – Brett Hale Mar 13 '22 at 10:10
  • @BrettHale: the [same code will be generated in either case](https://godbolt.org/z/eExWhch6f), so it’s more a matter of style or personal preference. – Paul R Mar 13 '22 at 11:16