Is there an intrinsic that will set a single value at all the places in an input array where the corresponding position had a 1 bit in the provided BitMask?
10101010 is bitmask
value is 121
it will set positions 0,2,4,6 with value 121
Is there an intrinsic that will set a single value at all the places in an input array where the corresponding position had a 1 bit in the provided BitMask?
10101010 is bitmask
value is 121
it will set positions 0,2,4,6 with value 121
With AVX512, yes. Masked stores are a first-class operation in AVX512.
Use the bitmask as an AVX512 mask for a vector store to an array, using _mm512_mask_storeu_epi8 (void* mem_addr, __mmask64 k, __m512i a)
vmovdqu8
. (AVX512BW. With AVX512F, you can only use 32 or 64-bit element size.)
#include <immintrin.h>
#include <stdint.h>
void set_value_in_selected_elements(char *array, uint64_t bitmask, uint8_t value) {
__m512i broadcastv = _mm512_set1_epi8(value);
// integer types are implicitly convertible to/from __mmask types
// the compiler emits the KMOV instruction for you.
_mm512_mask_storeu_epi8 (array, bitmask, broadcastv);
}
This compiles (with gcc7.3 -O3 -march=skylake-avx512
) to:
vpbroadcastb zmm0, edx
kmovq k1, rsi
vmovdqu8 ZMMWORD PTR [rdi]{k1}, zmm0
vzeroupper
ret
If you want to write zeros in the elements where the bitmap was zero, either use a zero-masking move to create a constant from the mask and store that, or create a 0 / -1 vector using AVX512BW or DQ __m512i _mm512_movm_epi8(__mmask64 )
. Other element sizes are available. But using a masked store makes it possible to safely use it when the array size isn't a multiple of the vector width, because the unmodified elements aren't read / rewritten or anything; they're truly untouched. (The CPU can take a slow microcode assist if any of the untouched elements would have faulted on a real store, though.)
Without AVX512, you still asked for "an intrinsic" (singular).
There's pdep
, which you can use to expand a bitmap to a byte-map. See my AVX2 left-packing answer for an example of using _pdep_u64(mask, 0x0101010101010101);
to unpack each bit in mask
to a byte. This gives you 8 bytes in a uint64_t
. In C, if you use a union
between that and an array, then it gives you an array of 0 / 1 elements. (But of course indexing the array will require the compiler to emit shift instructions, if it hasn't spilled it somewhere first. You probably just want to memcpy
the uint64_t
into a permanent array.)
But in the more general case (larger bitmaps), or even with 8 elements when you want to blend in new values based on the bitmask, you should use multiple intrinsics to implement the inverse of pmovmskb
, and use that to blend. (See the without pdep
section below)
In general, if your array fits in 64 bits (e.g. an 8-element char array), you can use pdep
. Or if it's an array of 4-bit nibbles, then you can do a 16-bit mask instead of 8.
Otherwise there's no single instruction, and thus no intrinsic. For larger bitmaps, you can process it in 8-bit chunks and store 8-byte chunks into the array.
If your array elements are wider than 8 bits (and you don't have AVX512), you should probably still expand bits to bytes with pdep
, but then use [v]pmovzx
to expand from bytes to dwords or whatever in a vector. e.g.
// only the low 8 bits of the input matter
__m256i bits_to_dwords(unsigned bitmap) {
uint64_t mask_bytes = _pdep_u64(bitmap, 0x0101010101010101); // expand bits to bytes
__m128i byte_vec = _mm_cvtsi64x_si128(mask_bytes);
return _mm256_cvtepu8_epi32(byte_vec);
}
If you want to leave elements unmodified instead of setting them to zero where the bitmask had zeros, OR with the previous contents instead of assigning / storing.
This is rather inconvenient to express in C / C++ (compared to asm). To copy 8 bytes from a uint64_t
into a char array, you can (and should) just use memcpy
(to avoid any undefined behaviour because of pointer aliasing or misaligned uint64_t*
). This will compile to a single 8-byte store with modern compilers.
But to OR them in, you'd either have to write a loop over the bytes of the uint64_t
, or cast your char array to uint64_t*
. This usually works fine, because char*
can alias anything so reading the char array later doesn't have any strict-aliasing UB. But a misaligned uint64_t*
can cause problems even on x86, if the compiler assumes that it is aligned when auto-vectorizing. Why does unaligned access to mmap'ed memory sometimes segfault on AMD64?
Assigning a value other than 0 / 1
Use a multiply by 0xFF
to turn the mask of 0/1 bytes into a 0 / -1 mask, and then AND that with a uint64_t
that has your value broadcasted to all byte positions.
If you want to leave element unmodified instead of setting them to zero or value=121
, you should probably use SSE2 / SSE4 or AVX2 even if your array has byte elements. Load the old contents, vpblendvb
with set1(121)
, using the byte-mask as a control vector.
vpblendvb
only uses the high bit of each byte, so your pdep
constant can be 0x8080808080808080
to scatter the input bits to the high bit of each byte, instead of the low bit. (So you don't need to multiply by 0xFF
to get an AND mask).
If your elements are dword or larger, you could use _mm256_maskstore_epi32
. (Use pmovsx
instead of zx to copy the sign bit when expanding the mask from bytes to dwords). This can be a perf win over a variable-blend + always read / re-write. Is it possible to use SIMD instruction for replace?.
pdep
pdep
is very slow on Ryzen, and even on Intel it's maybe not the best choice.
The alternative is to turn your bitmask into a vector mask:
is there an inverse instruction to the movemask instruction in intel avx2? and
How to perform the inverse of _mm256_movemask_epi8 (VPMOVMSKB)?.
i.e. broadcast your bitmap to every position of a vector (or shuffle it so the right bit of the bitmap in in the corresponding byte), and use a SIMD AND to mask off the appropriate bit for that byte. Then use pcmpeqb/w/d
against the AND-mask to find the elements that had their bit set.
You're probably going to want to load / blend / store if you don't want to store zeros where the bitmap was zero.
Use the compare-mask to blend on your value
, e.g. with _mm_blendv_epi8
or the 256bit AVX2 version. You can handle bitmaps in 16-bit chunks, producing 16-byte vectors with just a pshufb
to send bytes of it to the right elements.
It's not safe for multiple threads to do this at the same time on the same array even if their bitmaps don't intersect, unless you use masked stores, though.