Intrinsic to set value in array based on a BitMask

Question

Is there an intrinsic that will set a single value at all the places in an input array where the corresponding position had a 1 bit in the provided BitMask?

10101010 is bitmask

value is 121

it will set positions 0,2,4,6 with value 121

Peter Cordes · Accepted Answer · 2018-01-31T18:11:42.873

With AVX512, yes. Masked stores are a first-class operation in AVX512.

Use the bitmask as an AVX512 mask for a vector store to an array, using _mm512_mask_storeu_epi8 (void* mem_addr, __mmask64 k, __m512i a) vmovdqu8. (AVX512BW. With AVX512F, you can only use 32 or 64-bit element size.)

#include <immintrin.h>
#include <stdint.h>

void set_value_in_selected_elements(char *array, uint64_t bitmask, uint8_t value) {
    __m512i broadcastv = _mm512_set1_epi8(value);
    // integer types are implicitly convertible to/from __mmask types
    // the compiler emits the KMOV instruction for you.
    _mm512_mask_storeu_epi8 (array, bitmask, broadcastv);
}

This compiles (with gcc7.3 -O3 -march=skylake-avx512) to:

    vpbroadcastb    zmm0, edx
    kmovq   k1, rsi
    vmovdqu8        ZMMWORD PTR [rdi]{k1}, zmm0
    vzeroupper
    ret

If you want to write zeros in the elements where the bitmap was zero, either use a zero-masking move to create a constant from the mask and store that, or create a 0 / -1 vector using AVX512BW or DQ __m512i _mm512_movm_epi8(__mmask64 ). Other element sizes are available. But using a masked store makes it possible to safely use it when the array size isn't a multiple of the vector width, because the unmodified elements aren't read / rewritten or anything; they're truly untouched. (The CPU can take a slow microcode assist if any of the untouched elements would have faulted on a real store, though.)

Without AVX512, you still asked for "an intrinsic" (singular).

There's pdep, which you can use to expand a bitmap to a byte-map. See my AVX2 left-packing answer for an example of using _pdep_u64(mask, 0x0101010101010101); to unpack each bit in mask to a byte. This gives you 8 bytes in a uint64_t. In C, if you use a union between that and an array, then it gives you an array of 0 / 1 elements. (But of course indexing the array will require the compiler to emit shift instructions, if it hasn't spilled it somewhere first. You probably just want to memcpy the uint64_t into a permanent array.)

But in the more general case (larger bitmaps), or even with 8 elements when you want to blend in new values based on the bitmask, you should use multiple intrinsics to implement the inverse of pmovmskb, and use that to blend. (See the without pdep section below)

In general, if your array fits in 64 bits (e.g. an 8-element char array), you can use pdep. Or if it's an array of 4-bit nibbles, then you can do a 16-bit mask instead of 8.

Otherwise there's no single instruction, and thus no intrinsic. For larger bitmaps, you can process it in 8-bit chunks and store 8-byte chunks into the array.

If your array elements are wider than 8 bits (and you don't have AVX512), you should probably still expand bits to bytes with pdep, but then use [v]pmovzx to expand from bytes to dwords or whatever in a vector. e.g.

// only the low 8 bits of the input matter
__m256i bits_to_dwords(unsigned bitmap) {
    uint64_t mask_bytes = _pdep_u64(bitmap, 0x0101010101010101);  // expand bits to bytes
    __m128i byte_vec = _mm_cvtsi64x_si128(mask_bytes);
    return _mm256_cvtepu8_epi32(byte_vec);
}

If you want to leave elements unmodified instead of setting them to zero where the bitmask had zeros, OR with the previous contents instead of assigning / storing.

This is rather inconvenient to express in C / C++ (compared to asm). To copy 8 bytes from a uint64_t into a char array, you can (and should) just use memcpy (to avoid any undefined behaviour because of pointer aliasing or misaligned uint64_t*). This will compile to a single 8-byte store with modern compilers.

But to OR them in, you'd either have to write a loop over the bytes of the uint64_t, or cast your char array to uint64_t*. This usually works fine, because char* can alias anything so reading the char array later doesn't have any strict-aliasing UB. But a misaligned uint64_t* can cause problems even on x86, if the compiler assumes that it is aligned when auto-vectorizing. Why does unaligned access to mmap'ed memory sometimes segfault on AMD64?

Assigning a value other than 0 / 1

Use a multiply by 0xFF to turn the mask of 0/1 bytes into a 0 / -1 mask, and then AND that with a uint64_t that has your value broadcasted to all byte positions.

If you want to leave element unmodified instead of setting them to zero or value=121, you should probably use SSE2 / SSE4 or AVX2 even if your array has byte elements. Load the old contents, vpblendvb with set1(121), using the byte-mask as a control vector.

vpblendvb only uses the high bit of each byte, so your pdep constant can be 0x8080808080808080 to scatter the input bits to the high bit of each byte, instead of the low bit. (So you don't need to multiply by 0xFF to get an AND mask).

If your elements are dword or larger, you could use _mm256_maskstore_epi32. (Use pmovsx instead of zx to copy the sign bit when expanding the mask from bytes to dwords). This can be a perf win over a variable-blend + always read / re-write. Is it possible to use SIMD instruction for replace?.

Without `pdep`

pdep is very slow on Ryzen, and even on Intel it's maybe not the best choice.

The alternative is to turn your bitmask into a vector mask: is there an inverse instruction to the movemask instruction in intel avx2? and
How to perform the inverse of _mm256_movemask_epi8 (VPMOVMSKB)?.

i.e. broadcast your bitmap to every position of a vector (or shuffle it so the right bit of the bitmap in in the corresponding byte), and use a SIMD AND to mask off the appropriate bit for that byte. Then use pcmpeqb/w/d against the AND-mask to find the elements that had their bit set.

You're probably going to want to load / blend / store if you don't want to store zeros where the bitmap was zero.

Use the compare-mask to blend on your value, e.g. with _mm_blendv_epi8 or the 256bit AVX2 version. You can handle bitmaps in 16-bit chunks, producing 16-byte vectors with just a pshufb to send bytes of it to the right elements.

It's not safe for multiple threads to do this at the same time on the same array even if their bitmaps don't intersect, unless you use masked stores, though.

Did you know you can use `PCLMULQDQ` 3 times to spread the bits like this? That's not very fast, but on Ryzen it's at least not as bad as `pdep` — harold, Jan 31 '18 at 15:41
@harold: Interesting. But when the pattern is regular like this, the usual methods for the inverse of `pmovmskb` are better. I mostly mentioned `pdep` because the question was "is there an intrinsic (singular) for this?" — Peter Cordes, Jan 31 '18 at 17:40

Intrinsic to set value in array based on a BitMask

1 Answers1

Without pdep

Without `pdep`