SIMD, SSE, AVX - mask 8 floats by unsigned char?

Question

How can I mask 8 floats in my __m256 variable via bits in my unsigned char variable? (their values are not known during compilation)

__m256 flts = _mm256_set1_ps(5.0f);
unsigned char = 0b10010111;//just for example, but can be any value during runtime

Desired output would have flts contain 5, 0, 0, 5, 0, 5, 5, 5

Is there an efficient instruction on Intel Intrinsics Guide?

Processor only supports instructions up to AVX (but not AVX2 or beyond)

To mask at runtime you would need to load the values into register then perform a masking operation on them (such as an 'AND' operation) with a mask you make. You won't be able to modify your actual instructions executed by the cpu at runtime. — Chase R Lewis, Oct 09 '19 at 22:09
https://stackoverflow.com/questions/36488675/is-there-an-inverse-instruction-to-the-movemask-instruction-in-intel-avx2/36491672 — Mysticial, Oct 09 '19 at 22:10
If your mask is already stored as a `m256`, you can use the `blendv` intrinsics with a zero vector. Otherwise, you would probably have to build the mask by hand, as Walter's answer shows — Tobias Ribizel, Oct 09 '19 at 22:33
Updated my answer on the linked duplicate with an intrinsics version. It's *much* more efficient than the `setr` answer with scalar bit tests given here. Obviously use it to get a mask and apply with `_mm256_and_ps` because `0.0 & anything` is `0.0`, i.e. normal SIMD masking. — Peter Cordes, Oct 10 '19 at 03:47

Walter · Answer 1 · 2019-10-09T22:44:15.163

0

If you had AVX512F and AVX512VL you could use this:

auto input    = _mm256_set1_ps(5.0f);
__mmask8 mask = 0b10101010;
auto masked   = _mm256_maskz_mov_ps(mask,input);

Otherwise, you must use the bitwise, when you first must find a way to 'unpack' the 8-bits into 8 32-bit fields of an __m256, for example

static constexpr int32_t all_mask=int(0xffffffff);
audo tmp      = _mm256_setr_epi32(mask&1 ? all_mask:0, mask&2  ? all_mask:0,
                                  mask&4 ? all_mask:0, mask&8  ? all_mask:0,
                                  mask&16? all_mask:0, mask&32 ? all_mask:0,
                                  mask&64? all_mask:0, mask&128? all_mask:0);
auto masked   = _mm256_and_ps(tmp,input);

(I may have confused _mm256_setr_epi32 and _mm256_set_epi32.) There is presumably a faster way to unpack the mask, see this answer.

In other words, in this case it's perhaps better to never use a 8-bit integer ask mask, but directly use __m256 or __m256i.

edited Oct 09 '19 at 22:44

answered Oct 09 '19 at 22:31

Walter

44,150
20
113
196

3

You can do *much* better for the inverse of movemask using SIMD instead of `_mm_set`/`setr`. Will google when I have time. – Peter Cordes Oct 09 '19 at 22:34
2

Your version is pretty much a performance disaster with GCC9.2 (everything scalar and then insert each mask separately into a vector), and not much better with clang9.0. https://godbolt.org/z/m7elLc (Although clang does come up with some interesting stuff like doing one half of the bitmap with SIMD integer multiply by large single-bit constants, and using `vpsrad` for one half). Still, these are *much* worse than using a 256-bit AND + FP compare to check for matching FP bit-patterns. – Peter Cordes Oct 10 '19 at 04:08

SIMD, SSE, AVX - mask 8 floats by unsigned char?

1 Answers1