Intel intrinsics: vector comparison result to array of bool conversion

Question

I have several functions used to compare floating-point math vectors that fill an array of booleans (with result of each comparison). Currently, i am comparing them element-by-element, however i would like to use SIMD operations to optimize it.

The issue is, however, that intel intrinsics such as _mm_cmpeq_ps return a mask where every element is 32-bit. I am a little lost on how to convert the comparison mask to an array of booleans (guaranteed to be 8-bit).

I could shuffle every element of the SIMD vector, then extract the low elements, but i dont think that would provide an efficiency boost over manual element-by-element comparison.

Is there a way to cast the vector compare mask to a boolean array?

What do you want to do with the boolean array afterwards? In most cases it is much more efficient to directly use the mask for some follow-up operation(s). — chtz, Jun 23 '22 at 20:05
@PeterCordes Sadly i cannot use the bitmask directly, i need an array of 8-bit booleans. — JustClaire, Jun 23 '22 at 20:20

Peter Cordes · Accepted Answer · 2022-06-24T02:45:46.140

A bitmap is a more efficient way to store it, if you can have the rest of your program use that. (e.g. via Fastest way to unpack 32 bits to a 32 byte SIMD vector or is there an inverse instruction to the movemask instruction in intel avx2? if you want to use it with other vectors).

Or if you can cache-block it and use at most a couple KiB of mask vectors, you could just store the compare results directly for reuse without packing them down. (In an array of alignas(16) int32_t masks[], in case you want to access from scalar code). But only if you can do it with a small footprint in L1d. Or much better, use it on the fly as a mask for another vector operation so you're not storing/reloading mask data.

`packssdw`/`packsswb` dword compare results down to bytes

You're correct, if you don't want your elements packed down to single bits, don't use _mm_movemask_ps or epi8. Instead, use vector pack instructions. cmpps produces elements of all-zero / all-one bits, i.e. integer 0 (false) or -1 (true).

Signed integer pack instructions preserve 0 / -1 values, because both are in range for int8_t¹.

To keep the compiler happy, you need _mm_castps_si128 to reinterpret a __m128 as a __m128i.

This works most efficiently packing 4 vectors of 4 float compare results each down to one vector of 16 separate bytes. (Or with AVX2, 4 vecs of 8 floats down to 1 vec of 32 bytes, requiring an extra permute at the end because _mm256_packs_epi32 and so on operate in-lane, two separate 16-byte pack operations. Probably a _mm256_permutevar8x32_epi32 vpermd with a vector constant as the control operand)

// or  bool *result  if you keep the abs value (_mm_abs_epi8) for 0 / 1 output
void cmp(int8_t *result, const float *a)
{
  __m128 cmp0 = _mm_cmp_ps(...);  // produces integer 0 or -1 elements
  __m128 cmp1 = _mm_cmp_ps(...);
  __m128 cmp2 = _mm_cmp_ps(...);
  __m128 cmp3 = _mm_cmp_ps(...);

   // 2x 32-bit dword -> 16-bit word  with signed saturation - packssdw
  __m128i lo_words = _mm_packs_epi32(_mm_castps_si128(cmp0), _mm_castps_si128(cmp1));
  __m128i hi_words  = _mm_packs_epi32(_mm_castps_si128(cmp2), _mm_castps_si128(cmp3));

  __m128i cmp_bytes = _mm_packs_epi16(lo_words, hi_words);  // packsswb: 0 / -1

 // if necessary create 0 / 1 bools.  If not, just store cmp_bytes
  cmp_bytes = _mm_abs_epi8(cmp_bytes);                        // SSSE3
  //cmp_bytes = _mm_and_si128(cmp_bytes, _mm_set1_epi8(1));   // SSE2

  _mm_storeu_si128((__m128i*)result, cmp_bytes); 
}

Getting a 0/1 instead of 0/-1 takes a _mm_and_si128 or SSSE3 _mm_abs_epi8, if you truly need bool instead of a zero/non-zero uint8_t[] or int8_t[].

If you only have a single vector of float, you'd want SSSE3 _mm_shuffle_epi8 (pshufb) to grab 1 byte from each dword, for _mm_storeu_si32 (beware it was broken in early GCC11 versions, and wasn't even supported before then. But now it is supported as a strict-aliasing-safe unaligned store. Otherwise use _mm_cvtsi128_si32 to int, and memcpy that to an array of bool.)

Footnote 1: signed pack instructions are the only good choice

All pack instructions before AVX-512F vpmovdb / vpmovusdb do saturation (not truncation), and treat their inputs as signed. This makes unsigned-pack instructions useless; we'd need to mask both inputs first or they'd saturate -1 to 0, not 0xffff to 0xff.

punpcklwd / punpckhwd can interleave 16-bit words from two registers, but only from the low or high half of those registers. So not a great building-block.

Truncation would also work, but there's only SSSE3 pshufb, no 2-register shuffles as useful as the pack.. instructions until AVX-512. vpblendw could interleave halves of dwords in two different input registers)

Even with AVX-512, vpmovdb only has one input register, vs. vpack... instructions that produce a full-width output with elements from two full-width inputs. (In 16-byte lanes separately, so you'd still need a vpermd at the end to put your 4-byte chunks that came from lanes of 4 floats into the right order).

Of course, AVX-512 using 512-bit vector width can only compare-into-mask. This is fantastic for storing a bitmap, just vcmpps k1, zmm0, [rsi] / kmov [rdi], k1. But for storing a bool array, probably you'd want to kunpck to concatenate compare results, with 2x kunpckwd to combine 16-bit to 32-bit masks, then kunpckdq to make a single 64-bit mask from 64 float compare results. Then use that with a zero-masked vmovdqu8 zmm0{k1}{z}, zmm1 and store that to memory. (A memory destination only allows merge-masking, not zero-masking.)

AVX-512 could still be potentially useful with only 256-bit registers (to avoid turbo penalties and so on), although vpermt2w / vpermt2b aren't single-uop even on Ice Lake.

Compiler auto-vectorization of scalar source is sub-optimal

Compilers do auto-vectorize (https://godbolt.org/z/3o58W919Y), but do a rather poor job. Still very likely faster than scalar, especially with AVX2 available.

clang packs each vector down separately, for 4-byte stores. But the individual packing is decently efficient. With AVX2, it jumps through some extra hoops, vextractf128 and then packing 8 dwords down to 8 bytes, before shuffling that together with another 8 bools and then vinserti128 with another 16. So it's eventually storing 32 bytes at a time of bools, but takes a lot of shuffles to get there.

Per YMM store, clang -march=haswell does 4 vextract, 8 packs, 2 vinsert, 1 vpunpcklqdq, 1 vpermq for a total of 16 shuffles, one shuffle per 2 bytes of output. My version does 3 shuffles per 16 bytes of output, or with AVX2, 4 per 32 bytes if you widen everything to __m256i / _mm256... and add a final shuffle to fix up for lane-crossing. (Plus 4x vcmpps and 1x vpabsb to flip -1 to +1.)

GCC uses unsigned pack instructions like packusdw as the first step, doing pand instructions on each input to each pack instruction. And also unnecessary one between the two steps because it's I think emulating unsigned->unsigned packing in terms of SSE4.1 packusdw / SSE2 packuswb signed->unsigned packs. Even if it's stuck on using unsigned packing, it would be a lot less bad to just mask (or pabsd) the value to 0 or 1 so no further masking is needed before or after 2 packing steps.

(SSE2 packssdw preserves -1 or 0 just fine, without even saturating. Seems GCC isn't keeping track of the limited value-range of compare results, so doesn't realize it can let the pack instructions just work.)

And without SSE4.1, GCC does even worse. With only SSSE3 it uses some pshufb and por instructions, to feed 2xpand/SSE2 packuswb.

With word->byte pack instructions all treating their inputs as signed, it makes some sense that Intel omitted packusdw 32 -> 16-bit pack until SSE4.1, since the normal first step for packing dwords to bytes is packssdw, even if you eventually want to clamp signed integers to a 0..255 range.

I agree regarding signed saturation, but regarding "Unsigned packs would also saturate 0xFFFFFFFF to 0xFFFF": Are there SSE instructions which do unsigned->unsigned packs? `packusdw` packs `int32_t` to `uint16_t`, i.e. `0xFFFFFFFF = -1 --> 0x0000` — chtz, Jun 24 '22 at 01:15
@chtz: Oh right, I got that right in my head earlier, but then got it wrong again while rushing to finish this answer before going out. No, there aren't instructions which do unsigned->unsigned saturated packing, until [AVX-512F `vpmovusdb`](https://www.felixcloutier.com/x86/vpmovdb:vpmovsdb:vpmovusdb). You're correct, `packusdw` would saturate `-1` to `0`. — Peter Cordes, Jun 24 '22 at 01:19

Intel intrinsics: vector comparison result to array of bool conversion

1 Answers1

`packssdw`/`packsswb` dword compare results down to bytes

Footnote 1: signed pack instructions are the only good choice

Compiler auto-vectorization of scalar source is sub-optimal

Linked

Related

Intel intrinsics: vector comparison result to array of bool conversion

1 Answers1

packssdw/packsswb dword compare results down to bytes

Footnote 1: signed pack instructions are the only good choice

Compiler auto-vectorization of scalar source is sub-optimal

Linked

Related

`packssdw`/`packsswb` dword compare results down to bytes