How to negate (change sign) of the floating point elements in a __m128 type variable?

Question

Is there any single instruction or function that can invert the sign of every float inside a __m128? i.e. a = r0:r1:r2:r3 ===> a = -r0:-r1:-r2:-r3?

I know this can be done by _mm_sub_ps(_mm_set1_ps(0.0),a), but isn't it potentially slow since _mm_set1_ps(0.0) is a multi-instruction function?

Possible duplicate of [Flipping sign on packed SSE floats](http://stackoverflow.com/questions/3361132/flipping-sign-on-packed-sse-floats) — Antonio, Mar 11 '16 at 13:20

score 31 · Accepted Answer · answered Nov 19 '13 at 23:03

31

In practice your compiler should do a good job of generating the constant vector for 0.0. It will probably just use _mm_xor_ps, and if your code is in a loop it should hoist the constant generation out of the loop anyway. So, bottom line, use your original idea of:

v = _mm_sub_ps(_mm_set1_ps(0.0), v);

or another common trick, which is:

v = _mm_xor_ps(v, _mm_set1_ps(-0.0));

which just flips the sign bits instead of doing a subtraction (not quite as safe as the first method, since it doesn't do the right thing with NaNs, but may be more efficient in some cases).

answered Nov 19 '13 at 23:03

Paul R

208,748
37
389
560

9

I'm pretty sure `xor` is fine with NaN. The sign-bit in a NaN has don't-care status, so all NaNs stay as NaNs, and all non-NaNs stay non-NaN. Quiet vs. signalling NaN is indicated by the highest bit of the mantissa (not the highest bit of the whole float). Using `xor` is usually best. On AMD, where `xorps` runs in the integer domain and thus has a bypass delay to/from FP math instructions, it's still about the same latency as 5c `subps`. – Peter Cordes Mar 11 '16 at 13:20
3

Loading the `-0.0` constant from memory could cache-miss, though. Compilers don't like to [generate constants on the fly](http://stackoverflow.com/questions/35085059/what-are-the-best-instruction-sequences-to-generate-vector-constants-on-the-fly) if it takes more than one insn (`xorps same,same` or `pcmpeqw same,same` (all-ones)). This one just takes [`pcmpeqw xmm7,xmm7` / `pslld xmm7, 31`](http://stackoverflow.com/a/32422471/224132) (see that link for SSE absolute value: ANDN with that mask, or AND with its inverse) – Peter Cordes Mar 11 '16 at 13:25
Can't `_mm_set1_ps(0.0)` be replaced with `_mm_setzero_ps()`? – Marcin Poloczek Sep 21 '21 at 03:01
@MarcinPoliczek: yes, use whichever you prefer - the same code will be generated by the compiler in either case. – Paul R Sep 21 '21 at 07:12
Is there any way handle 0x80000000 (i.e -2147483648). I am expecting 2147483647 as result – Ram Kiran Dec 02 '21 at 07:24
Stick with `_mm_setzero_ps()` - it's really a 'synthetic' intrinsic that doesn't have an exact correspondence to a particular instruction. That means it conveys a *semantic* requirement in a modern compiler, and can often lead to more efficient code based on local code generation. – Brett Hale Mar 13 '22 at 10:10
@BrettHale: the [same code will be generated in either case](https://godbolt.org/z/eExWhch6f), so it’s more a matter of style or personal preference. – Paul R Mar 13 '22 at 11:16

How to negate (change sign) of the floating point elements in a __m128 type variable?

1 Answers1

Linked