_mm_maskmoveu_si128
is usually bad for performance. SSE2 maskmovdqu
has NT store semantics, so a mask that isn't all-ones will produce a non-full-line NT store.
On KNL where you don't have AVX512BW, a masked store of 16 bytes can be done with AVX-512F vpmovdb
, aka _mm512_mask_cvtepi32_storeu_epi8
, truncating a vector of 32-bit integers down to effectively a __m128i
, so per-element masking gives you byte-masking for the cacheable cache.
Widening conversions like _mm512_cvtepu8_epi32
(vpmovzxbd
) are also available, if it's more convenient to have your data as a __m128i
of 16x 8-bit integers instead of using an __m512i
of 16x 32-bit integers in the first place.
#include <immintrin.h>
#include <stdint.h>
static inline
void masked_16byte_store(void *dst, __mmask16 msk, __m128i src)
{
#ifdef __AVX512BW__
_mm_mask_storeu_epi8(dst, msk, src);
#else
__m512i v512 = _mm512_cvtepu8_epi32(src); // ideally work with 32-bit elements in the first place to skip this.
_mm512_mask_cvtepi32_storeu_epi8(dst, msk, v512);
// fun fact: GCC optimizes this to vmovdqu8 if available, but clang, ICC, and MSVC don't
#endif
}
void store_first_n_bytes(void *dst, int len, __m128i src)
{
// len=0 => mask=0000
// len=1 => mask=0001
// len=16 => mask=FFFF
//__mmask16 msk = _cvtu32_mask16(0xFFFFu >> (16-len));
//__mmask16 msk = _cvtu32_mask16(_rotl(0xFFFFu<<16, len)); // rotate n set bits into the bottom. But variable-count rotates are 3 or 2 uops on Intel
__mmask16 msk = _cvtu32_mask16((1U<<len) - 1); // int is 32 bits so even 1<<16 doesn't overflow
masked_16byte_store(dst, msk, src);
}
In your case, you don't need a lookup table for length->mask either, just use a bithack. It compiles to a couple instructions; possibly there'd be some advantage to kmov
from a lookup table1, if it always hits in cache and the front-end is a bottleneck (which Agner Fog says is common on KNL.) _cvtu32_mask16
to convert from unsigned
to __mmask16
is not necessary in practice; on the major implementations __mmask16
is just unsigned short
so implicit conversion works.
This compiles nicely with GCC, clang, and ICC (Godbolt), including for Skylake-avx512 (with AVX512BW):
# GCC 12 -O3 -march=knl
store_first_n_bytes(void*, int, long long __vector(2)):
vpmovzxbd zmm0, xmm0
mov eax, -1
shlx eax, eax, esi
not eax
kmovw k1, eax
vpmovdb XMMWORD PTR [rdi]{k1}, zmm0
ret
# ICC2021 -O3 -march=skylake-avx512 compiles the source more literally
store_first_n_bytes(void*, int, __m128i):
mov eax, 1 #24.21
shlx edx, eax, esi #24.21
dec edx #24.21
kmovw k1, edx #8.5
vmovdqu8 XMMWORD PTR [rdi]{k1}, xmm0 #8.5
ret #26.1
Or with -march=knl
, ICC uses add eax,-1
. And of course vpmovzxbd
/ vpmovdb
. It uses ZMM16 as a temporary, perhaps because on non-KNL that would avoid needing a vzeroupper. KNL doesn't ever need vzeroupper, and should avoid it because it's slow.
MSVC -O2 -arch:AVX512
uses basically the same asm as ICC. Using that compiler option defines __AVX512BW__
, so I wouldn't recommend using MSVC to build for KNL.
Even better would be xor eax,eax
/ bts eax, esi
(create 1<<len) / dec eax
, but compilers often fail to use BTS which is single-uop with a register destination on Intel CPUs. (https://uops.info/). The xor-zeroing is as cheap as a NOP on Sandybridge-family CPUs, and still fewer instruction bytes on others. Smaller machine-code size should help KNL's weak front-end.
kmov k, mem
costs on Xeon Phi vs. mainstream CPUs
According to Agner Fog's instruction tables, Xeon Phi (Knight's Landing) has kmovw k,m
as a single uop with 2/clock throughput. So a table lookup actually can save uops vs. a bithack.
But Skylake-server, Ice Lake, and Alder Lake (proxy for Sapphire Rapids) all run it as 3 uops for the front-end, 2 for the back-end (on ports p23 + p5). Basically like a mov
to a temporary general-purpose register plus a kmov k, r
. (https://uops.info/)
So a table lookup costs at least 3 front-end uops on mainstream CPUs, including a port 5 uop (like the kmovw k1, edx
). And that's if the table is in static storage in a non-PIE executable, and len
is already zero-extended to 64-bit. So you can use a non-RIP-relative addressing mode like kmovw k1, [LUT + rsi*2]
. If you need a RIP-relative LEA in there for position-independent code, you're break-even with the bithack for front-end uops. (Most Linux distros configure GCC with -fPIE -pie
on by default, so would lea rcx, [rip + LUT]
/ kmovw k1, [rcx + rsi*2]
)
Without AVX-512, cacheable masked stores only allow 32-bit or 64-bit element size, vmaskmovps/pd
and vpmaskmovd/q
, _mm_maskstore_epi32
and so on. They're rather slow on AMD at least Zen 3 and earlier (up to 42 uops, 12 cycle throughput for 256-bit store with 8x 32-bit elements), since masked stores require special hardware support even when fault suppression isn't required. Loads are easier; the hardware can just do a normal load and zero some elements after the fact.