A bitmap is a more efficient way to store it, if you can have the rest of your program use that. (e.g. via Fastest way to unpack 32 bits to a 32 byte SIMD vector or is there an inverse instruction to the movemask instruction in intel avx2? if you want to use it with other vectors).
Or if you can cache-block it and use at most a couple KiB of mask vectors, you could just store the compare results directly for reuse without packing them down. (In an array of alignas(16) int32_t masks[]
, in case you want to access from scalar code). But only if you can do it with a small footprint in L1d. Or much better, use it on the fly as a mask for another vector operation so you're not storing/reloading mask data.
packssdw
/packsswb
dword compare results down to bytes
You're correct, if you don't want your elements packed down to single bits, don't use _mm_movemask_ps
or epi8
. Instead, use vector pack instructions. cmpps
produces elements of all-zero / all-one bits, i.e. integer 0 (false) or -1 (true).
Signed integer pack instructions preserve 0 / -1 values, because both are in range for int8_t
1.
To keep the compiler happy, you need _mm_castps_si128
to reinterpret a __m128
as a __m128i
.
This works most efficiently packing 4 vectors of 4 float compare results each down to one vector of 16 separate bytes. (Or with AVX2, 4 vecs of 8 floats down to 1 vec of 32 bytes, requiring an extra permute at the end because _mm256_packs_epi32
and so on operate in-lane, two separate 16-byte pack operations. Probably a _mm256_permutevar8x32_epi32
vpermd
with a vector constant as the control operand)
// or bool *result if you keep the abs value (_mm_abs_epi8) for 0 / 1 output
void cmp(int8_t *result, const float *a)
{
__m128 cmp0 = _mm_cmp_ps(...); // produces integer 0 or -1 elements
__m128 cmp1 = _mm_cmp_ps(...);
__m128 cmp2 = _mm_cmp_ps(...);
__m128 cmp3 = _mm_cmp_ps(...);
// 2x 32-bit dword -> 16-bit word with signed saturation - packssdw
__m128i lo_words = _mm_packs_epi32(_mm_castps_si128(cmp0), _mm_castps_si128(cmp1));
__m128i hi_words = _mm_packs_epi32(_mm_castps_si128(cmp2), _mm_castps_si128(cmp3));
__m128i cmp_bytes = _mm_packs_epi16(lo_words, hi_words); // packsswb: 0 / -1
// if necessary create 0 / 1 bools. If not, just store cmp_bytes
cmp_bytes = _mm_abs_epi8(cmp_bytes); // SSSE3
//cmp_bytes = _mm_and_si128(cmp_bytes, _mm_set1_epi8(1)); // SSE2
_mm_storeu_si128((__m128i*)result, cmp_bytes);
}
Getting a 0/1 instead of 0/-1 takes a _mm_and_si128
or SSSE3 _mm_abs_epi8
, if you truly need bool
instead of a zero/non-zero uint8_t[]
or int8_t[]
.
If you only have a single vector of float, you'd want SSSE3 _mm_shuffle_epi8
(pshufb
) to grab 1 byte from each dword, for _mm_storeu_si32
(beware it was broken in early GCC11 versions, and wasn't even supported before then. But now it is supported as a strict-aliasing-safe unaligned store. Otherwise use _mm_cvtsi128_si32
to int, and memcpy
that to an array of bool
.)
Footnote 1: signed pack instructions are the only good choice
All pack instructions before AVX-512F vpmovdb
/ vpmovusdb
do saturation (not truncation), and treat their inputs as signed. This makes unsigned-pack instructions useless; we'd need to mask both inputs first or they'd saturate -1
to 0
, not 0xffff
to 0xff
.
punpcklwd
/ punpckhwd
can interleave 16-bit words from two registers, but only from the low or high half of those registers. So not a great building-block.
Truncation would also work, but there's only SSSE3 pshufb
, no 2-register shuffles as useful as the pack..
instructions until AVX-512. vpblendw
could interleave halves of dwords in two different input registers)
Even with AVX-512, vpmovdb
only has one input register, vs. vpack...
instructions that produce a full-width output with elements from two full-width inputs. (In 16-byte lanes separately, so you'd still need a vpermd
at the end to put your 4-byte chunks that came from lanes of 4 floats into the right order).
Of course, AVX-512 using 512-bit vector width can only compare-into-mask. This is fantastic for storing a bitmap, just vcmpps k1, zmm0, [rsi]
/ kmov [rdi], k1
. But for storing a bool array, probably you'd want to kunpck
to concatenate compare results, with 2x kunpckwd
to combine 16-bit to 32-bit masks, then kunpckdq
to make a single 64-bit mask from 64 float compare results. Then use that with a zero-masked vmovdqu8 zmm0{k1}{z}, zmm1
and store that to memory. (A memory destination only allows merge-masking, not zero-masking.)
AVX-512 could still be potentially useful with only 256-bit registers (to avoid turbo penalties and so on), although vpermt2w
/ vpermt2b
aren't single-uop even on Ice Lake.
Compiler auto-vectorization of scalar source is sub-optimal
Compilers do auto-vectorize (https://godbolt.org/z/3o58W919Y), but do a rather poor job. Still very likely faster than scalar, especially with AVX2 available.
clang packs each vector down separately, for 4-byte stores. But the individual packing is decently efficient. With AVX2, it jumps through some extra hoops, vextractf128 and then packing 8 dwords down to 8 bytes, before shuffling that together with another 8 bools and then vinserti128 with another 16. So it's eventually storing 32 bytes at a time of bools, but takes a lot of shuffles to get there.
Per YMM store, clang -march=haswell
does 4 vextract, 8 packs, 2 vinsert, 1 vpunpcklqdq, 1 vpermq for a total of 16 shuffles, one shuffle per 2 bytes of output. My version does 3 shuffles per 16 bytes of output, or with AVX2, 4 per 32 bytes if you widen everything to __m256i
/ _mm256...
and add a final shuffle to fix up for lane-crossing. (Plus 4x vcmpps
and 1x vpabsb
to flip -1 to +1.)
GCC uses unsigned pack instructions like packusdw
as the first step, doing pand
instructions on each input to each pack instruction. And also unnecessary one between the two steps because it's I think emulating unsigned->unsigned packing in terms of SSE4.1 packusdw
/ SSE2 packuswb
signed->unsigned packs. Even if it's stuck on using unsigned packing, it would be a lot less bad to just mask (or pabsd
) the value to 0 or 1 so no further masking is needed before or after 2 packing steps.
(SSE2 packssdw
preserves -1 or 0 just fine, without even saturating. Seems GCC isn't keeping track of the limited value-range of compare results, so doesn't realize it can let the pack instructions just work.)
And without SSE4.1, GCC does even worse. With only SSSE3 it uses some pshufb
and por
instructions, to feed 2xpand
/SSE2 packuswb
.
With word->byte pack instructions all treating their inputs as signed, it makes some sense that Intel omitted packusdw
32 -> 16-bit pack until SSE4.1, since the normal first step for packing dwords to bytes is packssdw
, even if you eventually want to clamp signed integers to a 0..255 range.