On a CPU with full-width AVX2 (like Zen2 or Haswell / Skylake), you'd probably do well with vpackssdw
/ vpacksswb
to horizontally pack down from qwords to bytes narrowing in half every time. So a total of 8 input vectors would becomes one vector that you do vpmovmskb
on (_mm256_movemask_epi8
). VCMPPD results are all-ones (-1) which stays -1, or all-zeros which stays 0, in both halves of a qword even if you use a narrower pack element size. But that packing is in-lane (within 128-bit halves of a vector), so after eventually packing down to bytes you need a vpshufb
+ vpermd
to get bytes in order before vpmovmskb
. (AMD doesn't have fast pdep
until Zen3, otherwise you could use that to interleave pairs of bits if you didn't do lane-crossing fixup shuffle.)
See How to convert 32-bit float to 8-bit signed char? (4:1 packing of int32 to int8 __m256i) for a 4:1 pack; an 8:1 makes the final shuffle more complicated unless we do more shuffles earlier, while dword chunks are small enough.
(I'm using asm mnemonic names because they're shorter to type and less clutter to read than intrinsics, and what you need to look anything up in instruction tables to find out how many uops things cost; https://uops.info/ or https://agner.org/optimize/)
But with every 256-bit SIMD operation costing 2 uops, you might do well on Zen 1 with just vmovmskpd
and scalar bit-shift / OR. If the surrounding code is all vector, having these uops use scalar integer ALUs is good. The front-end is 6 uops wide, or 5 instructions whichever is less, but there are only 4 each integer and SIMD ALU pipes, so ideally earlier and later code can overlap execution nicely. (And some specific ALU units have even more limited throughput, e.g. these shuffles on only 2 of the 4 ports.)
Or maybe one step vector packing and then _mm256_movemask_ps
? Lane-crossing shuffles are relatively expensive on Zen 1. But not too bad: vpermq
(or vpermpd
) is only 3 uops with 2 cycle throughput, vs. 2 uops with 1c throughput for vpackssdw
. (And 3 uops with 4c throughput for vpermd
.)
Assuming vpacksswd ymm
uses the same ports as the XMM version, that's FP1 / FP2. So it can partial overlap with vcmppd
which can run on FP01. (The YMM version of that also being 2 uops, 1c throughput if not mixed with other instructions.)
https://uops.info/ doesn't get that level of detail for multi-uop instructions on some AMD CPUs the way it does for Intel, but we can assume the YMM versions of non-lane-crossing versions are just two of the same uop as the XMM version where it does have that data.
You very likely don't want to use _mm256_cvtpd_ps
which costs shuffle uops and an FP->FP conversion. That costs 2 uops but only has one input vector, not two. Interpreting the compare result as a -NaN
double, you might well get a float -NaN
so it might actually work for correctness. It's definitely slower that way on most CPUs.
On Zen1 it has 2 cycle throughput, and that's per single input vector rather than a pair of vectors.
With 4x vpackssdw
we can reduce 8 vectors to 4.
Then 2x vpackssdw ymm
reduces to 2 vectors.
Then 1x vpacksswb ymm
reduces to 1 vector, with pairs of bytes in the wrong order.
For Zen 1, maybe start with 4 input vectors, and after reducing to one YMM, split it in half with vextracti128
which is only a single uop on Zen 1, for any port (since the two halves of a YMM register are already stored separately in physical registers). Then vpacksswb
the two halves together (1 uop), setting up for vpshufb xmm
(1 uop) to put pairs of bytes in the right order. That sets up for vpmovmskb
. So the only lane-crossing shuffle is just an extract.
Or instead of getting 16-bit chunks of bitmap, you could maybe do the above twice, then vinserti128 ymm, xmm, 1
(2 uops, 0.67c throughput) / vpmovmskb ymm
(1 uop) to get a 32-bit chunk of bitmap. Those 3 uops replace 2x vpmovmskb xmm
/ shl
/ or
, so you're saving a uop, and have good flexibility of what vector ALU port they can run on. Although it is more vector ALU pressure.