0

How can I mask 8 floats in my __m256 variable via bits in my unsigned char variable? (their values are not known during compilation)

__m256 flts = _mm256_set1_ps(5.0f);
unsigned char = 0b10010111;//just for example, but can be any value during runtime

Desired output would have flts contain 5, 0, 0, 5, 0, 5, 5, 5

Is there an efficient instruction on Intel Intrinsics Guide?

Processor only supports instructions up to AVX (but not AVX2 or beyond)

Kari
  • 1,244
  • 1
  • 13
  • 27
  • To mask at runtime you would need to load the values into register then perform a masking operation on them (such as an 'AND' operation) with a mask you make. You won't be able to modify your actual instructions executed by the cpu at runtime. – Chase R Lewis Oct 09 '19 at 22:09
  • https://stackoverflow.com/questions/36488675/is-there-an-inverse-instruction-to-the-movemask-instruction-in-intel-avx2/36491672 – Mysticial Oct 09 '19 at 22:10
  • Do you have __AVX512F__ and __AVX512VL__ ? – Walter Oct 09 '19 at 22:17
  • @walter, nope, only AVX with `m256` – Kari Oct 09 '19 at 22:20
  • If your mask is already stored as a `m256`, you can use the `blendv` intrinsics with a zero vector. Otherwise, you would probably have to build the mask by hand, as Walter's answer shows – Tobias Ribizel Oct 09 '19 at 22:33
  • 1
    Updated my answer on the linked duplicate with an intrinsics version. It's *much* more efficient than the `setr` answer with scalar bit tests given here. Obviously use it to get a mask and apply with `_mm256_and_ps` because `0.0 & anything` is `0.0`, i.e. normal SIMD masking. – Peter Cordes Oct 10 '19 at 03:47

1 Answers1

0

If you had AVX512F and AVX512VL you could use this:

auto input    = _mm256_set1_ps(5.0f);
__mmask8 mask = 0b10101010;
auto masked   = _mm256_maskz_mov_ps(mask,input);

Otherwise, you must use the bitwise, when you first must find a way to 'unpack' the 8-bits into 8 32-bit fields of an __m256, for example

static constexpr int32_t all_mask=int(0xffffffff);
audo tmp      = _mm256_setr_epi32(mask&1 ? all_mask:0, mask&2  ? all_mask:0,
                                  mask&4 ? all_mask:0, mask&8  ? all_mask:0,
                                  mask&16? all_mask:0, mask&32 ? all_mask:0,
                                  mask&64? all_mask:0, mask&128? all_mask:0);
auto masked   = _mm256_and_ps(tmp,input);

(I may have confused _mm256_setr_epi32 and _mm256_set_epi32.) There is presumably a faster way to unpack the mask, see this answer.

In other words, in this case it's perhaps better to never use a 8-bit integer ask mask, but directly use __m256 or __m256i.

Walter
  • 44,150
  • 20
  • 113
  • 196
  • 3
    You can do *much* better for the inverse of movemask using SIMD instead of `_mm_set`/`setr`. Will google when I have time. – Peter Cordes Oct 09 '19 at 22:34
  • 2
    Your version is pretty much a performance disaster with GCC9.2 (everything scalar and then insert each mask separately into a vector), and not much better with clang9.0. https://godbolt.org/z/m7elLc (Although clang does come up with some interesting stuff like doing one half of the bitmap with SIMD integer multiply by large single-bit constants, and using `vpsrad` for one half). Still, these are *much* worse than using a 256-bit AND + FP compare to check for matching FP bit-patterns. – Peter Cordes Oct 10 '19 at 04:08