I'm trying to figure out how to use masked loads and stores for the last few elements to be processed. My use case involves converting a packed 10 bit data stream to 16 bit which means loading 5 bytes before storing 4 shorts. This results in different masks of different types.
The main loop itself is not a problem. But at the end I'm left with up to 19 bytes input / 15 shorts output which I thought I could process in up to two loop iterations using the 128 bit vectors. Here is the outline of the code.
#include <immintrin.h>
#include <stddef.h>
#include <stdint.h>
void convert(uint16_t* out, ptrdiff_t n, const uint8_t* in)
{
uint16_t* const out_end = out + n;
for(uint16_t* out32_end = out + (n & -32); out < out32_end; in += 40, out += 32) {
/*
* insert main loop here using ZMM vectors
*/
}
if(out_end - out >= 16) {
/*
* insert half-sized iteration here using YMM vectors
*/
in += 20;
out += 16;
}
// up to 19 byte input remaining, up to 15 shorts output
const unsigned out_remain = out_end - out;
const unsigned in_remain = (out_remain * 10 + 7) / 8;
unsigned in_mask = (1 << in_remain) - 1;
unsigned out_mask = (1 << out_remain) - 1;
while(out_mask) {
__mmask16 load_mask = _cvtu32_mask16(in_mask);
__m128i packed = _mm_maskz_loadu_epi8(load_mask, in);
/* insert computation here. No masks required */
__mmask8 store_mask = _cvtu32_mask8(out_mask);
_mm_mask_storeu_epi16(out, store_mask, packed);
in += 10;
out += 8;
in_mask >>= 10;
out_mask >>= 8;
}
}
(Compile with -O3 -mavx2 -mavx512f -mavx512bw -mavx512vl -mavx512dq
)
My idea was to create a bit mask from the number of remaining elements (since I know it fits comfortably in an integer / mask register), then shift values out of the mask as they are processed.
I have two issues with this approach:
- I'm re-setting the masks from GP registers each iteration instead of using the
kshift
family of instructions _cvtu32_mask8
(kmovb
) is the only instruction in this code that requires AVX512DQ. Limiting the number of suitable hardware platforms just for that seems weird
What I'm wondering about:
Can I cast mmask32 to mmask16 and mmask8?
If I can, I could set it once from the GP register, then shift it in its own register. Like this:
__mmask32 load_mask = _cvtu32_mask32(in_mask);
__mmask32 store_mask = _cvtu32_mask32(out_mask);
while(out < out_end) {
__m128i packed = _mm_maskz_loadu_epi8((__mmask16) load_mask, in);
/* insert computation here. No masks required */
_mm_mask_storeu_epi16(out, (__mmask8) store_mask, packed);
load_mask = _kshiftri_mask32(load_mask, 10);
store_mask = _kshiftri_mask32(store_mask, 8);
in += 10;
out += 8;
}
GCC seems to be fine with this pattern. But Clang and MSVC create worse code, moving the mask in and out of GP registers without any apparent reason.