3

I wish to performance a conversion between __mmask16 and __m128i. However, as posted at https://stackoverflow.com/a/32247779/6889542

/* convert 16 bit mask to __m128i control byte mask */
_mm_maskz_broadcastb_epi8((__mmask16)mask,_mm_set1_epi32(~0))

_mm_maskz_broadcastb_epi8 and anything similar to it are not available on KNL yet. The lack of AVX512BW on KNL (Xeon Phi 7210) is really becoming a headache for me.

The origin of the problem is that I wish to take advantage of

_mm_maskmoveu_si128 (__m128i a, __m128i mask, char* mem_addr)

while using

__mmask16 len2mask[] = { 0x0000, 0x0001, 0x0003, 0x0007,
                         0x000F, 0x001F, 0x003F, 0x007F,
                         0x00FF, 0x01FF, 0x03FF, 0x07FF,
                         0x0FFF, 0x1FFF, 0x3FFF, 0x7FFF,
                         0xFFFF };
Community
  • 1
  • 1
veritas
  • 196
  • 13
  • Might be a bit late but I'd suggest to simply precompute a `__m128i len2maskvector` table like your `__mmask16 len2mask` table and directly fetch the vector you want. That saves an instruction at the cost of 224 additional bytes in cache. If you need to do it from a mask: `kmov, popcnt (or lzcnt), loada`. – Christoph Diegelmann Jun 27 '17 at 11:44

2 Answers2

1

_mm_maskmoveu_si128 is usually bad for performance. SSE2 maskmovdqu has NT store semantics, so a mask that isn't all-ones will produce a non-full-line NT store.

On KNL where you don't have AVX512BW, a masked store of 16 bytes can be done with AVX-512F vpmovdb, aka _mm512_mask_cvtepi32_storeu_epi8, truncating a vector of 32-bit integers down to effectively a __m128i, so per-element masking gives you byte-masking for the cacheable cache.

Widening conversions like _mm512_cvtepu8_epi32 (vpmovzxbd) are also available, if it's more convenient to have your data as a __m128i of 16x 8-bit integers instead of using an __m512i of 16x 32-bit integers in the first place.

#include <immintrin.h>
#include <stdint.h>

static inline
void masked_16byte_store(void *dst, __mmask16 msk, __m128i src)
{
#ifdef __AVX512BW__
    _mm_mask_storeu_epi8(dst, msk, src);
#else
    __m512i v512 = _mm512_cvtepu8_epi32(src);       // ideally work with 32-bit elements in the first place to skip this.
    _mm512_mask_cvtepi32_storeu_epi8(dst, msk, v512);
    // fun fact: GCC optimizes this to vmovdqu8 if available, but clang, ICC, and MSVC don't
#endif
}

void store_first_n_bytes(void *dst, int len, __m128i src)
{
   // len=0  => mask=0000
   // len=1  => mask=0001
   // len=16 => mask=FFFF

    //__mmask16 msk = _cvtu32_mask16(0xFFFFu >> (16-len));
    //__mmask16 msk = _cvtu32_mask16(_rotl(0xFFFFu<<16, len));  // rotate n  set bits into the bottom.  But variable-count rotates are 3 or 2 uops on Intel
    __mmask16 msk = _cvtu32_mask16((1U<<len) - 1);   // int is 32 bits so even 1<<16 doesn't overflow
    masked_16byte_store(dst, msk, src);
}

In your case, you don't need a lookup table for length->mask either, just use a bithack. It compiles to a couple instructions; possibly there'd be some advantage to kmov from a lookup table1, if it always hits in cache and the front-end is a bottleneck (which Agner Fog says is common on KNL.) _cvtu32_mask16 to convert from unsigned to __mmask16 is not necessary in practice; on the major implementations __mmask16 is just unsigned short so implicit conversion works.

This compiles nicely with GCC, clang, and ICC (Godbolt), including for Skylake-avx512 (with AVX512BW):

 # GCC 12 -O3 -march=knl
store_first_n_bytes(void*, int, long long __vector(2)):
        vpmovzxbd       zmm0, xmm0
        mov     eax, -1
        shlx    eax, eax, esi
        not     eax
        kmovw   k1, eax
        vpmovdb XMMWORD PTR [rdi]{k1}, zmm0
        ret
 # ICC2021 -O3 -march=skylake-avx512   compiles the source more literally
store_first_n_bytes(void*, int, __m128i):
        mov       eax, 1                                        #24.21
        shlx      edx, eax, esi                                 #24.21
        dec       edx                                           #24.21
        kmovw     k1, edx                                       #8.5
        vmovdqu8  XMMWORD PTR [rdi]{k1}, xmm0                   #8.5
        ret                                                     #26.1

Or with -march=knl, ICC uses add eax,-1. And of course vpmovzxbd / vpmovdb. It uses ZMM16 as a temporary, perhaps because on non-KNL that would avoid needing a vzeroupper. KNL doesn't ever need vzeroupper, and should avoid it because it's slow.

MSVC -O2 -arch:AVX512 uses basically the same asm as ICC. Using that compiler option defines __AVX512BW__, so I wouldn't recommend using MSVC to build for KNL.

Even better would be xor eax,eax / bts eax, esi (create 1<<len) / dec eax, but compilers often fail to use BTS which is single-uop with a register destination on Intel CPUs. (https://uops.info/). The xor-zeroing is as cheap as a NOP on Sandybridge-family CPUs, and still fewer instruction bytes on others. Smaller machine-code size should help KNL's weak front-end.


kmov k, mem costs on Xeon Phi vs. mainstream CPUs

According to Agner Fog's instruction tables, Xeon Phi (Knight's Landing) has kmovw k,m as a single uop with 2/clock throughput. So a table lookup actually can save uops vs. a bithack.

But Skylake-server, Ice Lake, and Alder Lake (proxy for Sapphire Rapids) all run it as 3 uops for the front-end, 2 for the back-end (on ports p23 + p5). Basically like a mov to a temporary general-purpose register plus a kmov k, r. (https://uops.info/)

So a table lookup costs at least 3 front-end uops on mainstream CPUs, including a port 5 uop (like the kmovw k1, edx). And that's if the table is in static storage in a non-PIE executable, and len is already zero-extended to 64-bit. So you can use a non-RIP-relative addressing mode like kmovw k1, [LUT + rsi*2]. If you need a RIP-relative LEA in there for position-independent code, you're break-even with the bithack for front-end uops. (Most Linux distros configure GCC with -fPIE -pie on by default, so would lea rcx, [rip + LUT] / kmovw k1, [rcx + rsi*2])


Without AVX-512, cacheable masked stores only allow 32-bit or 64-bit element size, vmaskmovps/pd and vpmaskmovd/q, _mm_maskstore_epi32 and so on. They're rather slow on AMD at least Zen 3 and earlier (up to 42 uops, 12 cycle throughput for 256-bit store with 8x 32-bit elements), since masked stores require special hardware support even when fault suppression isn't required. Loads are easier; the hardware can just do a normal load and zero some elements after the fact.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
0

If you are actually intend to do generate something like:

__m128i mask = _mm_maskz_broadcastb_epi8(len2mask[length],_mm_set1_epi32(~0))

Why not just:

void foo(int length, char* mem_addr, const __m128i a)
{
    __m128i count = _mm_set_epi8(15,14,13,12,11,10,9,8,7,6,5,4,3,2,1,0);
    __m128i mask = _mm_cmpgt_epi8(_mm_set1_epi8(length), count);
    _mm_maskmoveu_si128 (a, mask, mem_addr);
}

Godbolt demonstration.

chtz
  • 17,329
  • 4
  • 26
  • 56
  • `_mm_maskmoveu_si128` is usually bad for performance (SSE2 `maskmovdqu` has NT store semantics). On KNL where you don't have AVX512BW, a masked store of 16 bytes can be done with AVX-512F `vpmovdb`, aka `_mm512_mask_cvtepi32_storeu_epi8`, truncating a vector of 32-bit integers down to effectively a `__m128i`. (Widening conversions like `_mm512_cvtepu8_epi32` (`vpmovzxbd`) are also available.) – Peter Cordes Jul 18 '22 at 19:04