Efficiently load/compute/pack 64 double comparison results in uint64_t bitmask

Question

I want to load/compare/pack as runtime efficent as possible the results of 64 double comparisons into a uint64_t bitmask.

My current approach is to compare 2*2 pairs via AVX2 using _mm256_cmp_pd. The result of both (=8) comparisons is converted into a bitmap using _mm256_movemask_pd and via a|b<<4 assigned as byte into a union (1x uint64_t / 8 uint8_t) to save a few shifts/or's.

This example might help to visualize

union ui64 {
    uint64_t i64;
    struct { uint8_t i0,i1,i2,i3,i4,i5,i6,i7; } i8;
};
static inline uint64_t cmp64 (double* in0, double* in1) {
    ui64 result;
    // +0
    result.i8.i0 =
        _mm256_movemask_pd(
            _mm256_cmp_pd(
            _mm256_load_pd(in0 + 0), 
            _mm256_load_pd(in1 + 0), 
            _CMP_LT_OQ)) |
        _mm256_movemask_pd(
            _mm256_cmp_pd(
                _mm256_load_pd(in0 + 4), 
                _mm256_load_pd(in1 + 4),
                _CMP_LT_OQ)) << 4;

    // +8
    // load, compare, pack n+8 each 'iteration' using result.i8.i1,...i7
    // ...

    return result.i64;
}

Variants of compress&set appear straight forward, but use slower instructions: 1x _mm256_set_m128 and 2x_mm256_cvtpd_ps versus a scalar << and | like this

_mm256_movemask_ps(_mm256_set_m128(
    _mm256_cvtpd_ps(v0),
    _mm256_cvtpd_ps(v1)));

Used CPU is a Zen 1 (max AVX2), not sure if GPU usage(Nvidia) is an option.

Please share your thoughts.

Packing three-way compare `cmpgt(b, a) - cmpgt(a, b)` results into bytes would just be a variation of these answers. In that case, the cross-lane shuffles could be removed by interleaving `gt` and `lt` (using `unpack`) just before `vpmovmskb`. — aqrit, Feb 04 '22 at 03:33
@Chris: Check the update to Soont's answer: we saved a few more uops, especially for Zen1. First with a scalar rotate of the bitmap, then with an even better idea that clang's shuffle optimizer found which saved a `vpshufb` in `combine32`, replacing `vperm2i128` with `vpermq`. (Big win on Zen1, maybe about neutral on Zen2/3, and nice win on Intel.) — Peter Cordes, Feb 04 '22 at 15:25
For future readers, AVX-512 can do this with compare-into-mask 8 doubles at a time, and concatenate pairs of masks with 4x [`kunpckbw`](https://www.felixcloutier.com/x86/kunpckbw:kunpckwd:kunpckdq) / 2x `KUNPCKWD` / 1x `KUNPCKDQ` — Peter Cordes, Feb 05 '22 at 10:05

Soonts · Accepted Answer · 2022-02-04T14:25:41.147

3

Here’s an example. It packs the comparison results into bytes with whatever instructions were the most efficient, shuffles into correct order once per 32 numbers, and uses _mm256_movemask_epi8 to produce 32 bits at once.

// Compare 4 numbers, return 32 bytes with results of 4 comparisons:
// 00000000 11111111 22222222 33333333
inline __m256d compare4( const double* a, const double* b )
{
    return _mm256_cmp_pd( _mm256_load_pd( a ), _mm256_load_pd( b ), _CMP_LT_OQ );
}

// Compare 8 numbers, combine 8 results into the following bytes:
// 0000 4444 1111 5555  2222 6666 3333 7777
inline __m256i compare8( const double* a, const double* b )
{
    __m256 c0 = _mm256_castpd_ps( compare4( a, b ) );
    __m256 c1 = _mm256_castpd_ps( compare4( a + 4, b + 4 ) );
    return _mm256_castps_si256( _mm256_blend_ps( c0, c1, 0b10101010 ) );
}

// Compare 16 numbers, combine 16 results into following bytes:
// 00 44 11 55  88 CC 99 DD  22 66 33 77  AA EE BB FF
inline __m256i compare16( const double* a, const double* b )
{
    __m256i c0 = compare8( a, b );
    __m256i c1 = compare8( a + 8, b + 8 );
    return _mm256_packs_epi32( c0, c1 );
}

inline uint32_t compare32( const double* a, const double* b )
{
    // Compare 32 numbers and merge them into a single vector
    __m256i c0 = compare16( a, b );
    __m256i c1 = compare16( a + 16, b + 16 );
    __m256i src = _mm256_packs_epi16( c0, c1 );

    // We got the 32 bytes, but the byte order is screwed up in that vector:
    // 0   4   1   5   8   12  9   13  16  20  17  21  24  28  25  29
    // 2   6   3   7   10  14  11  15  18  22  19  23  26  30  27  31
    // The following 2 instructions are fixing the order

    // Shuffle 8-byte pieces across the complete vector
    // That instruction is relatively expensive on most CPUs, but we only doing it once per 32 numbers
    src = _mm256_permute4x64_epi64( src, _MM_SHUFFLE( 3, 1, 2, 0 ) );

    // The order of bytes in the vector is still wrong:
    // 0    4   1   5   8  12   9  13    2   6   3   7  10  14  11  15
    // 13  16  20  17  21  24  28  25   18  22  19  23  26  30  27  31
    // Need one more shuffle instruction

    const __m128i perm16 = _mm_setr_epi8(
        0, 2, 8, 10,  1, 3, 9, 11,   4, 6, 12, 14,  5, 7, 13, 15 );
    // If you calling this in a loop and everything is inlined,
    // the shuffling vector should be pre-loaded outside of the loop.
    const __m256i perm = _mm256_broadcastsi128_si256( perm16 );

    // Shuffle the bytes
    src = _mm256_shuffle_epi8( src, perm );

    // The order of bytes is now correct, can use _mm256_movemask_epi8 to make 32 bits of the result
    return (uint32_t)_mm256_movemask_epi8( src );
}


uint64_t compareAndPack64( const double* a, const double* b )
{
    uint64_t low = compare32( a, b );
    uint64_t high = compare32( a + 32, b + 32 );
    return low | ( high << 32 );
}

edited Feb 04 '22 at 14:25

answered Feb 03 '22 at 20:24

Soonts

20,079
9
57
130

1

Oh, good idea to start with dword blends, yeah that's better than my suggestion, especially on Intel CPUs, but I often forget to look for that when shuffling is needed, too. But since we need a byte shuffle anyway, it's pure win, I think. Or is that why you need `_mm256_permute2x128_si256`? It's not expensive on Intel CPUs or Zen2, 1 uop / 3c latency, better than some other lane-crossing shuffles on Zen2. But yeah it's quite bad on Zen1. Can you replace that with scalar `rol` on the movemask result? That's be a win everywhere; no I guess you can't, if you need to OR halves together. – Peter Cordes Feb 03 '22 at 20:41
1

Zen1 would benefit from 2x movemask (1 uop each even for YMM) and scalar rotate / OR to flip one of the movemask results and then combine there. – Peter Cordes Feb 03 '22 at 20:42
1

@PeterCordes Good point, but to reliably rotate these masks, one need either VC++ compiler for `_rotr` intrinsic, or C++/20 language edition for `std::rotr` in the standard library. The OP specified none of these tags. – Soonts Feb 04 '22 at 10:37
1

[Best practices for circular shift (rotate) operations in C++](https://stackoverflow.com/q/776508) shows that rotate can be done portably in a way that reliably compiles to a rotate instruction. Especially with a compile-time constant count of `16`, more compilers and older versions of them are able to recognize that idiom, even MSVC which has problems optimizing the variable-count idiom. – Peter Cordes Feb 04 '22 at 11:00
1

It's not going to be a "couple cycles" slower on CPUs other than Zen1. It's only 1 extra uop for the front-end (assuming BMI2 `rorx`), and it's trading a lane-crossing shuffle (`vperm2i128`) for an extra `movemask` + `rorx`. That's a latency win on most CPUs, and less competition for shuffle port(s) for throughput. (And we're trading a SIMD OR for a scalar OR). So I think at worst it'll cost a fraction of a cycle of extra front-end throughput, if that was the bottleneck, otherwise probably helps. – Peter Cordes Feb 04 '22 at 13:30
1

Wait a minute, clang's shuffle optimizer found something even better in your original compare, just `vpermq` + one `vpshufb` after the pack. https://godbolt.org/z/KWr1o7r1a. That's a lot cheaper on Zen1 than `vperm2i128`. Probably worth trying clang for the full thing, not just the compare32 part... https://godbolt.org/z/n5Mxo7YeT - I think it's still doing your shuffles except for that last step. Nicely interleaving shuffles with `vcmppd`, but yeah looking at it all inlined there's a lot of total shuffles. – Peter Cordes Feb 04 '22 at 13:36
1

BTW, MSVC needs `` to define `_rotr`; you forgot that. (Also, `rorx` isn't needed, just regular `ror` is equally good; I forgot that the original value of that mask isn't needed later, it's a separate shuffle.) But anyway that's probably no longer useful vs. porting clang's optimized asm back to intrinsics. – Peter Cordes Feb 04 '22 at 13:47
1

@PeterCordes Yeah, it saved 2 instructions + some data, at the cost of slightly more expensive shuffle. Probably a win overall, updated – Soonts Feb 04 '22 at 14:26
1

More expensive shuffle? Oh, you mean on Zen2, where `vpermq` is 2 uops vs. 1 for `vperm2i128`? On every other CPU, it's as cheap or cheaper (especially Zen1). (Err, on Alder Lake E-cores it's 1 cycle higher latency, but both are 2 uops.) Huh, I'd forgotten vpermq was more expensive on some CPUs, despite being 1-input. – Peter Cordes Feb 04 '22 at 14:33
1

@PeterCordes `vperm2i128` is literally twice as fast on my computer (Zen 3) compared to `vpermq`, 3 versus 6 cycles of latency. – Soonts Feb 04 '22 at 14:43
2

Twice the latency, but throughput is very similar (1.27 vs. 1.0 on both Zen2 and 3; apparently the throughput is worse than one would expect from the uops / ports for vpermq on Zen2/3). Most SIMD code isn't bottlenecked on latency; your use-case does `compare32` twice so there's some ILP even that close to the top of the fan-in tree. Anyway, overall saving a `vpshufb` is probably at least break-even in terms of shuffle throughput on Zen2/3, and saving the `vpor` is great. – Peter Cordes Feb 04 '22 at 14:46
It's impressive to see how much potential can be harnessed in seemingly simple instructions through optimization and different ways of thinking. – Chris G. Feb 05 '22 at 07:41

Peter Cordes · Answer 2 · 2022-02-03T17:08:31.860

On a CPU with full-width AVX2 (like Zen2 or Haswell / Skylake), you'd probably do well with vpackssdw / vpacksswb to horizontally pack down from qwords to bytes narrowing in half every time. So a total of 8 input vectors would becomes one vector that you do vpmovmskb on (_mm256_movemask_epi8). VCMPPD results are all-ones (-1) which stays -1, or all-zeros which stays 0, in both halves of a qword even if you use a narrower pack element size. But that packing is in-lane (within 128-bit halves of a vector), so after eventually packing down to bytes you need a vpshufb + vpermd to get bytes in order before vpmovmskb. (AMD doesn't have fast pdep until Zen3, otherwise you could use that to interleave pairs of bits if you didn't do lane-crossing fixup shuffle.)
See How to convert 32-bit float to 8-bit signed char? (4:1 packing of int32 to int8 __m256i) for a 4:1 pack; an 8:1 makes the final shuffle more complicated unless we do more shuffles earlier, while dword chunks are small enough.

(I'm using asm mnemonic names because they're shorter to type and less clutter to read than intrinsics, and what you need to look anything up in instruction tables to find out how many uops things cost; https://uops.info/ or https://agner.org/optimize/)

But with every 256-bit SIMD operation costing 2 uops, you might do well on Zen 1 with just vmovmskpd and scalar bit-shift / OR. If the surrounding code is all vector, having these uops use scalar integer ALUs is good. The front-end is 6 uops wide, or 5 instructions whichever is less, but there are only 4 each integer and SIMD ALU pipes, so ideally earlier and later code can overlap execution nicely. (And some specific ALU units have even more limited throughput, e.g. these shuffles on only 2 of the 4 ports.)

Or maybe one step vector packing and then _mm256_movemask_ps? Lane-crossing shuffles are relatively expensive on Zen 1. But not too bad: vpermq (or vpermpd) is only 3 uops with 2 cycle throughput, vs. 2 uops with 1c throughput for vpackssdw. (And 3 uops with 4c throughput for vpermd.)

Assuming vpacksswd ymm uses the same ports as the XMM version, that's FP1 / FP2. So it can partial overlap with vcmppd which can run on FP01. (The YMM version of that also being 2 uops, 1c throughput if not mixed with other instructions.)

https://uops.info/ doesn't get that level of detail for multi-uop instructions on some AMD CPUs the way it does for Intel, but we can assume the YMM versions of non-lane-crossing versions are just two of the same uop as the XMM version where it does have that data.

You very likely don't want to use _mm256_cvtpd_ps which costs shuffle uops and an FP->FP conversion. That costs 2 uops but only has one input vector, not two. Interpreting the compare result as a -NaN double, you might well get a float -NaN so it might actually work for correctness. It's definitely slower that way on most CPUs.
On Zen1 it has 2 cycle throughput, and that's per single input vector rather than a pair of vectors.

With 4x vpackssdw we can reduce 8 vectors to 4.
Then 2x vpackssdw ymm reduces to 2 vectors.
Then 1x vpacksswb ymm reduces to 1 vector, with pairs of bytes in the wrong order.

For Zen 1, maybe start with 4 input vectors, and after reducing to one YMM, split it in half with vextracti128 which is only a single uop on Zen 1, for any port (since the two halves of a YMM register are already stored separately in physical registers). Then vpacksswb the two halves together (1 uop), setting up for vpshufb xmm (1 uop) to put pairs of bytes in the right order. That sets up for vpmovmskb. So the only lane-crossing shuffle is just an extract.

Or instead of getting 16-bit chunks of bitmap, you could maybe do the above twice, then vinserti128 ymm, xmm, 1 (2 uops, 0.67c throughput) / vpmovmskb ymm (1 uop) to get a 32-bit chunk of bitmap. Those 3 uops replace 2x vpmovmskb xmm / shl / or, so you're saving a uop, and have good flexibility of what vector ALU port they can run on. Although it is more vector ALU pressure.

Together with `vpshufb` it seems logical to use `vpackssdw` indeed. The order (a0-3, b0-3, a4-7, b4-7, a8-11, b8-11, a12-15, b12-15) appeared weird on first look. — Chris G., Feb 03 '22 at 18:14
@ChrisG.: Yeah, AVX / AVX2 extending shuffle operations to two separate 128-bit in-lane operations pretty much sucks, except that CPUs like Zen1 with half-width execution units can implement it efficiently (as just 2 uops), and in terms of complexity on full-width CPUs I guess. That's the only upside; it's pure downside for most uses, needing an extra instruction. AVX-512 is the same, but provide many powerful new shuffles with narrower granularity you can use instead. — Peter Cordes, Feb 03 '22 at 18:28

Efficiently load/compute/pack 64 double comparison results in uint64_t bitmask

2 Answers2