how to store low 8 bits in every 32 bits data lane avx512?

Question

how to store low 8 bits in every 32 bits data lane, for example, a zmm register stores 16 32-bit integer, I only need to store low 8 bits data to memory which is a int8_t array?

Which language do you use? C? X86_64 Assembly? C++? – digito_evo Dec 26 '22 at 13:18 — digito_evo, Dec 26 '22 at 13:18
So add the C tag to your question. More people will see it. – digito_evo Dec 26 '22 at 13:30 — digito_evo, Dec 26 '22 at 13:30

Peter Cordes · Answer 1 · 2022-12-26T20:27:45.857

There's an instruction for that, vpmovdb (https://www.felixcloutier.com/x86/vpmovdb:vpmovsdb:vpmovusdb). But unfortunately Intel CPUs run it as 2 uops, both for port 5. (Or 3 total with a memory destination; https://uops.info/ and https://agner.org/optimize/). C intrinsics

VPMOVDB __m128i _mm512_cvtepi32_epi8( __m512i a);
VPMOVDB __m128i _mm512_mask_cvtepi32_epi8(__m128i s, __mmask16 k, __m512i a);
VPMOVDB __m128i _mm512_maskz_cvtepi32_epi8( __mmask16 k, __m512i a);
VPMOVDB void _mm512_mask_cvtepi32_storeu_epi8(void * d, __mmask16 k, __m512i a);

Fun fact: it allows a memory destination and requires only AVX512F, not AVX512BW, so it's how KNL Xeon Phi could do byte-granularity masked stores. There are signed and unsigned saturating forms, but you want the truncating form.

Truncation and lane-crossing packing are both new in AVX-512; in AVX2 you'd typically have to use 2x vpackssdw to feed vpackuswb and a vpermq fixup, also requiring masking the 32-bit data to be in the 0..255 unsigned range before packing, which would cost an extra vpand per input vector.

With multiple vectors

If you have multiple vectors of data, it could be worthwhile to use AVX-512VBMI (Ice Lake / Zen 4) vpermt2b to grab bytes from two vectors. That runs as 3 uops (2p5 + p015) on Intel CPUs that support it, so it's 2 cycles per 256-bit vector instead of 2 per 128-bit vector.

It's 1 uop on Zen 4, same as vpmovdb, so vpermt2b is ideal there, allowing one 256-bit store per clock, twice as fast as Intel and with fewer front-end uops.

vpack... in-lane pack instructions are single uop on Intel and do 2:1 packing of 2 registers into one of the same width. With a vpermt2d shuffle at the end, this could come out ahead for multiple vectors.

So given 4 ZMM registers, 4x vpmovdb would take 8 cycles of throughput on Intel. 4x vpand + 2x vpackssdw + 1x vpackuswb + vpermq will generate one ZMM ready to store per 4 clock cycles on Skylake/Cascade Lake or later.

(vpermt2w is available on Skylake-avx512 so would be usable instead of vpackuswb + vpermq, but is 3 uops there and on later Intel, unfortunately.)

Another idea with AVX2 instructions: how to convert uint32 to uint8 using simd but not avx512? - vpshufb (_mm256_shuffle_epi8) to pack and zero, setting up for a 2-input shuffle with wider granularity. But that requires one port5 uop per input vector.

Intel's C/C++ intrinsics guide (https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html) has a search which works on asm mnemonics; they're shorter to type and talk about.

how to store low 8 bits in every 32 bits data lane avx512?

1 Answers1

With multiple vectors