how to store low 8 bits in every 32 bits data lane, for example, a zmm register stores 16 32-bit integer, I only need to store low 8 bits data to memory which is a int8_t array?
1 Answers
There's an instruction for that, vpmovdb
(https://www.felixcloutier.com/x86/vpmovdb:vpmovsdb:vpmovusdb). But unfortunately Intel CPUs run it as 2 uops, both for port 5. (Or 3 total with a memory destination; https://uops.info/ and https://agner.org/optimize/). C intrinsics
VPMOVDB __m128i _mm512_cvtepi32_epi8( __m512i a);
VPMOVDB __m128i _mm512_mask_cvtepi32_epi8(__m128i s, __mmask16 k, __m512i a);
VPMOVDB __m128i _mm512_maskz_cvtepi32_epi8( __mmask16 k, __m512i a);
VPMOVDB void _mm512_mask_cvtepi32_storeu_epi8(void * d, __mmask16 k, __m512i a);
Fun fact: it allows a memory destination and requires only AVX512F, not AVX512BW, so it's how KNL Xeon Phi could do byte-granularity masked stores. There are signed and unsigned saturating forms, but you want the truncating form.
Truncation and lane-crossing packing are both new in AVX-512; in AVX2 you'd typically have to use 2x vpackssdw
to feed vpackuswb
and a vpermq
fixup, also requiring masking the 32-bit data to be in the 0..255 unsigned range before packing, which would cost an extra vpand
per input vector.
- How do I efficiently reorder bytes of a __m256i vector (convert int32_t to uint8_t)?
- How to convert 32-bit float to 8-bit signed char? (4:1 packing of int32 to int8 __m256i)
With multiple vectors
If you have multiple vectors of data, it could be worthwhile to use AVX-512VBMI (Ice Lake / Zen 4) vpermt2b
to grab bytes from two vectors. That runs as 3 uops (2p5 + p015) on Intel CPUs that support it, so it's 2 cycles per 256-bit vector instead of 2 per 128-bit vector.
It's 1 uop on Zen 4, same as vpmovdb
, so vpermt2b
is ideal there, allowing one 256-bit store per clock, twice as fast as Intel and with fewer front-end uops.
vpack...
in-lane pack instructions are single uop on Intel and do 2:1 packing of 2 registers into one of the same width. With a vpermt2d
shuffle at the end, this could come out ahead for multiple vectors.
So given 4 ZMM registers, 4x vpmovdb
would take 8 cycles of throughput on Intel. 4x vpand
+ 2x vpackssdw
+ 1x vpackuswb
+ vpermq
will generate one ZMM ready to store per 4 clock cycles on Skylake/Cascade Lake or later.
(vpermt2w
is available on Skylake-avx512 so would be usable instead of vpackuswb
+ vpermq
, but is 3 uops there and on later Intel, unfortunately.)
Another idea with AVX2 instructions: how to convert uint32 to uint8 using simd but not avx512? - vpshufb
(_mm256_shuffle_epi8
) to pack and zero, setting up for a 2-input shuffle with wider granularity. But that requires one port5 uop per input vector.
Intel's C/C++ intrinsics guide (https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html) has a search which works on asm mnemonics; they're shorter to type and talk about.

- 328,167
- 45
- 605
- 847