Writing a portable SSE/AVX version of std::copysign

Question

I am currently writing a vectorized version of the QR decomposition (linear system solver) using SSE and AVX intrinsics. One of the substeps requires to select the sign of a value opposite/equal to another value. In the serial version, I used std::copysign for this. Now I want to create a similar function for SSE/AVX registers. Unfortunately, the STL uses a built-in function for that, so I can't just copy the code and turn it into SSE/AVX instructions.

I have not tried it yet (so I have no code to show for now), but my simple approach would be to create a register with all values set to -0.0 so that only the signed bit is set. Then I would use an AND operation on the source to find out if its sign is set or not. The result of this operation would either be 0.0 or -0.0, depending on the sign of the source. With the result, I would create a bitmask (using logic operations) which I can combine with the target register (using another logic operation) to set the sign accordingly.

However, I am not sure if there isn't a smarter way to solve this. If there is a built-in function for fundamental data types like floats and doubles, maybe there is also an intrinsic that I missed. Any suggestions?

Thanks in advance

EDIT:

Thanks to "chtz" for this useful link:

https://godbolt.org/z/oY0f7c

So basically std::copysign compiles to a sequence of 2 AND operations and a subsequent OR. I will reproduce this for SSE/AVX and post the result here in case somebody else needs it some day :)

EDIT 2:

Here is my working version:

__m128 CopySign(__m128 srcSign, __m128 srcValue)
{
    // Extract the signed bit from srcSign
    const __m128 mask0 = _mm_set1_ps(-0.);
    __m128 tmp0 = _mm_and_ps(srcSign, mask0);

    // Extract the number without sign of srcValue (abs(srcValue))
    __m128 tmp1 = _mm_andnot_ps(mask0, srcValue);

    // Merge signed bit with number and return
    return _mm_or_ps(tmp0, tmp1);
}

Tested it with:

__m128 a = _mm_setr_ps(1, -1, -1, 1);
__m128 b = _mm_setr_ps(-5, -11, 3, 4);

__m128 c = CopySign(a, b);

for (U32 i = 0; i < 4; ++i)
    std::cout << simd::GetValue(c, i) << std::endl;

The output is as expected:

However, I also tried the version from the disassembly where

__m128 tmp1 = _mm_andnot_ps(mask0, srcValue);

is replaced with:

const __m128 mask1 = _mm_set1_ps(NAN);
__m128 tmp1 = _mm_and_ps(srcValue, mask1);

The results are quite strange:

4
-8
-3
4

Depending on the chosen numbers, the number is sometimes okay and sometimes not. The sign is always correct. It seems like NaN is not !(-0.0) for some reason. I remember that I had some issues before when I tried to set register values to NaN or specific bit patterns. Maybe somebody has an idea about the origin of the problem?

EDIT 3:

As 'Maxim Egorushkin' clarified in the comments of his answer, my expectation about NaN being !(-0.0) is wrong. NaN seems not to be a unique bit pattern (see https://steve.hollasch.net/cgindex/coding/ieeefloat.html).

Thank you very much to all of you!

Welcome to stackoverflow and you should try it first. As a hint for future questions please read https://stackoverflow.com/help/how-to-ask. And please do not assume someone will write code for you. — Zaiborg, Sep 10 '19 at 12:40
@Zaiborg The OP was asking about the wisdom of their proposed approach, not how to code it or whether it would work or not. — Sneftel, Sep 10 '19 at 12:48
Unless you have any a-priory knowledge about any of your inputs, you won't get better than doing one `andnps`, `andps` and `orps` each. If you know that your `x` is always positive, you can save the `andnps` operation. If you already have a register with `abs(x)` and one with `-abs(x)`, you could do a single `blendvps` (but I think this is only worth it for Zen) — chtz, Sep 10 '19 at 12:54
I do not expect somebody to write code for me. Sorry if it sounds that way. It is more like that I think that there might be a much simpler way (existing intrinsics, single logic operation) than what I described. I will start writing the code in a couple of hours and edit it into my post. However, doesn't make sense to write code if somebody tells me ' just use _mm__ps' for that or 'just use XOR with xyz to get what you want' — wychmaster, Sep 10 '19 at 12:58
Btw: You should also always check, what your compiler already does for you: https://godbolt.org/z/oY0f7c — chtz, Sep 10 '19 at 13:04
@chtz: Unfortunately, there is no a-priory knowledge about the inputs. So as I feared, I have to use multiple logic operations. I will write the function and edit it into the original post as soon as it is done. Maybe, you guys see some potential for optimizations. However, if it is really just 3 operations it would probably take only 3 cycles which isn't too bad. — wychmaster, Sep 10 '19 at 13:08
Can't use the blend instructions for this, they're not bit-granular, they select whole floats at once (or bytes at the narrowest). So it's back to the old and/and(n)/or sequence — harold, Sep 10 '19 at 13:11
On all Intel CPUs (I know of), bit-operations work on any port of `P015`, so you should get a throughput of about 1 cycle (latency is another question). — chtz, Sep 10 '19 at 13:13
@chtz: Wow, thanks for the link. Of course, makes total sense to take a look at what the compiler does. — wychmaster, Sep 10 '19 at 13:16
If you `__restrict` the output (or the compiler knows itself that input and output do not overlap), clang also happily auto-vectorizes this for you (gcc apparently does not in this case): https://godbolt.org/z/NH7oMF — chtz, Sep 10 '19 at 13:28
`NaN` is any sign bit, followed by the maximum exponent (all ones), followed by non-zero mantissa (infinity is zero mantissa). I.e. (non-existing bitwise) `~-0.` is a `NaN`, however a NaN is not necessarily (non-existing bitwise) `~-0.`. — Maxim Egorushkin, Sep 10 '19 at 15:43
https://en.m.wikipedia.org/wiki/NaN says: "a bit-wise IEEE floating-point standard single precision (32-bit) NaN would be: s111 1111 1xxx xxxx xxxx xxxx xxxx xxxx where s is the sign (most often ignored in applications) and the x sequence represents a non-zero number (the value zero encodes infinities). The first bit from x is used to determine the type of NaN: "quiet NaN" or "signaling NaN". The remaining bits encode a payload (most often ignored in applications)." — Maxim Egorushkin, Sep 10 '19 at 21:41
@Maxim Egorushkin: Very interesting. Thanks again for the additional information. — wychmaster, Sep 10 '19 at 22:38
@chtz: Actually I was wrong, about the a-priory information. Since the value of interest is the square root of a square sum, my x is always positive and I can drop the _mm_andnot_ps. Maybe I use a template parameter to choose if the _mm_andnot_ps should be used or not. Wouldn't have seen this possible micro-optimization if you hadn't mentioned it. — wychmaster, Sep 10 '19 at 22:46
with `-march=skylake-512` it generates just 1 instruction: `vpternlogd` — bolov, Jul 12 '20 at 23:32

Maxim Egorushkin · Accepted Answer · 2019-09-17T10:55:56.440

8

AVX versions for float and double:

#include <immintrin.h>

__m256 copysign_ps(__m256 from, __m256 to) {
    constexpr float signbit = -0.f;
    auto const avx_signbit = _mm256_broadcast_ss(&signbit);
    return _mm256_or_ps(_mm256_and_ps(avx_signbit, from), _mm256_andnot_ps(avx_signbit, to)); // (avx_signbit & from) | (~avx_signbit & to)
}

__m256d copysign_pd(__m256d from, __m256d to) {
    constexpr double signbit = -0.;
    auto const avx_signbit = _mm256_broadcast_sd(&signbit);
    return _mm256_or_pd(_mm256_and_pd(avx_signbit, from), _mm256_andnot_pd(avx_signbit, to)); // (avx_signbit & from) | (~avx_signbit & to)
}

assembly

The Intel Intrinsics Guide

With AVX2 avx_signbit can be generated with no constants:

__m256 copysign2_ps(__m256 from, __m256 to) {
    auto a = _mm256_castps_si256(from);
    auto avx_signbit = _mm256_castsi256_ps(_mm256_slli_epi32(_mm256_cmpeq_epi32(a, a), 31));
    return _mm256_or_ps(_mm256_and_ps(avx_signbit, from), _mm256_andnot_ps(avx_signbit, to)); // (avx_signbit & from) | (~avx_signbit & to)
}

__m256d copysign2_pd(__m256d from, __m256d to) {
    auto a = _mm256_castpd_si256(from);
    auto avx_signbit = _mm256_castsi256_pd(_mm256_slli_epi64(_mm256_cmpeq_epi64(a, a), 63));
    return _mm256_or_pd(_mm256_and_pd(avx_signbit, from), _mm256_andnot_pd(avx_signbit, to)); // (avx_signbit & from) | (~avx_signbit & to)
}

Still though, both clang and gcc calculate avx_signbit at compile time and replace it with constants loaded from .rodata section, which is, IMO, sub-optimal.

edited Sep 17 '19 at 10:55

answered Sep 10 '19 at 14:11

Maxim Egorushkin

131,725
17
180
271

1

Is it bad or unreliable to use `-0.0` instead of that integer constant and the `memcpy`? – harold Sep 10 '19 at 14:24
@harold You are right, `-0.` generates exactly the same code. I was hoping for the compiler to generate `movss` and `movsd` instructions without using constants in the data section, however, the compiler generated those `avx_sigbit` in `.data` anyway. – Maxim Egorushkin Sep 10 '19 at 14:27
Thanks for your answer. That is basically what I added to my original post. I just use _mm_set1_ps instead of broadcast. Is there any reason to favour one about the other? And maybe you could have a look at the problem with 'NaN' that I added to my question. – wychmaster Sep 10 '19 at 14:45
@Maxim Egorushkin - Thank you for the clarification. That explains why I get incorrect results with the second version. – wychmaster Sep 10 '19 at 14:54
With AVX512 you can use `vpternlogd` as a bit-blend to copy just the sign bits in one instruction. (Still need a `set1(-0.0)` mask though). – Peter Cordes Sep 17 '19 at 06:23
1

@wychmaster: `_mm_broadcast_ss()` gets GCC to compress the constant, vs. foolishly repeating the same thing for 32 bytes. https://godbolt.org/z/8HN7kl. Clang already does this optimization for `set1_ps()`. But clang shoots itself in the foot by transforming `andnot` into `and` and inventing a 2nd constant, and compressing them to scalar means neither of them are used as a memory-source operand, instead loaded separately with broadcast loads. https://godbolt.org/z/nqS7Hu – Peter Cordes Sep 17 '19 at 06:28
@PeterCordes Yep, IMO, those extra constants in `.data` are a pessimization. – Maxim Egorushkin Sep 17 '19 at 09:41
They're in `.rodata` of course, but yes two loads are clearly worse than one in this case, especially with AVX for non-destructive operations. – Peter Cordes Sep 17 '19 at 10:19
@PeterCordes Added AVX2 versions, the compilers are still being too smart. – Maxim Egorushkin Sep 17 '19 at 10:56
1

Usually you'd use this inside a loop where the compiler could hoist the loads anyway. One broadcast load from `.rodata` is probably good, and very likely to hit in at least L3 cache, or L2 if it's used often enough to make much perf difference. – Peter Cordes Sep 17 '19 at 11:04
@PeterCordes Thank you for the explanation. However, I have to admit, that I am not so familiar with assembly. Even though I understand, that an additional instruction (clang - broadcast) reduces performance, I am not really sure what the problem with GCC is. I see, that the constant for `_mm_set1_ps` has 4 values instead of 1. Does that mean, that all values are loaded one after each other or what is the problem here? And where do I find or what is `.data` and `.rodata`? – wychmaster Sep 19 '19 at 09:46

score 5 · Answer 2 · edited Jan 06 '21 at 20:16

5

Here's a version that I think is slightly better than the accepted answer if you target icc:

__m256d copysign_pd(__m256d from, __m256d to) {
    __m256d const avx_sigbit = _mm256_set1_pd(-0.);
    return _mm256_or_pd(_mm256_and_pd(avx_sigbit, from), _mm256_andnot_pd(avx_sigbit, to));
}

It uses _mm256_set1_pd rather than an broadcast intrinsic. On clang and gcc this is mostly a wash, but on icc the broadcast version actually writes a constant to the stack and then broadcasts from it, which is ... terrible.

Godbolt showing AVX-512 code, adjust the -march= to -march=skylake to see AVX2 code.

Here's an untested AVX-512 version which uses vpterlogdq directly, which compiles down to a single vpterlogd instruction on icc and clang (gcc includes a separate broadcast):

__m512d copysign_pd_alt(__m512d from, __m512d to) {
    const __m512i sigbit = _mm512_castpd_si512(_mm512_set1_pd(-0.));
    return _mm512_castsi512_pd(_mm512_ternarylogic_epi64(_mm512_castpd_si512(from), _mm512_castpd_si512(to), sigbit, 0xE4));
}

You could make a 256-bit version of this for when AVX-512 is enabled but you're dealing with __m256* vectors.

edited Jan 06 '21 at 20:16

Peter Cordes

328,167
45
605
847

answered Jul 12 '20 at 21:11

BeeOnRope

60,350
16
207
386

With AVX-512, I think `vpternlogd` can do this in one instruction, using the same mask constant. Oh, some compilers already optimize your code to that, but not all. Perhaps worth doing that explicitly for an AVX512 version. – Peter Cordes Jul 12 '20 at 21:21
1

@PeterCordes - yes, clang gets it down to 1 ternlog, icc gets close (but only takes the simple approach of compressing 2 bitops to 1 ternlog, leaving one bit op still there), gcc fails. – BeeOnRope Jul 12 '20 at 21:30
@PeterCordes - added the AVX-512 version, but it took me 15 frigging minutes to look up the magic constants to use to calculate the `vpterndlogq` constants. Oddly, clang needed the `epi64` version of ternlog not the `epi32` one to generate a broadcast. Without a kmask, those are identical operations no? Neither gcc or icc required that (gcc seems to create a 128-bit constant but then only broadcast from 64 bits of it). – BeeOnRope Jul 12 '20 at 23:29
Yes, they're identical except for the broadcast element size. Compilers are dumb :/ (or their internals were designed before broadcast memory constants were possible, and haven't yet gotten good mechanisms for decisions). At least they're doing better than they used to, not making full 256 or 512-bit memory constants when the only uses are with broadcast ops. – Peter Cordes Jul 13 '20 at 03:24

Writing a portable SSE/AVX version of std::copysign

2 Answers2

Linked

Related