_mm256_packs_epi32, except pack sequentially

Question

One can use _mm256_packs_epi32. as follows: __m256i e = _mm256_packs_epi32 ( ai, bi);

In the debugger, I see the value of ai: m256i_i32 = {0, 1, 0, 1, 1, 1, 0, 1}. I also see the value of bi: m256i_i32 = {1, 1, 1, 1, 0, 0, 0, 1}. The packing gave me e: m256i_i16 = {0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 1, 0, 0, 0, 1}. The packing is interleaved. So we have in e first four numbers in ai, first four numbers in bi, last four numbers in ai, last four numbers in bi in that order.

I am wondering if there is an instruction that just packs ai and bi side by side without the interleaving.

vpermq after packing would work, but I'm wondering if there's a single instruction to achieve this.

score 4 · Accepted Answer · answered Jun 15 '21 at 08:28

No sequential-across-lanes pack until AVX-512, unfortunately. (And even then only for 1 register, or not with saturation.)

The in-lane behaviour of shuffles like vpacksswd and vpalignr is one of the major warts of AVX2 that make the 256-bit versions of those shuffles less useful than their __m128i versions. But on Intel, and Zen2 CPUs, it is often still best to use __m256i vectors with a vpermq at the end, if you need the elements in a specific order. (Or vpermd with a vector constant after 2 levels of packing: How do I efficiently reorder bytes of a __m256i vector (convert int32_t to uint8_t)?)

If your 32-bit elements came from unpacking narrower elements, and you don't care about order of the wider elements, you can widen with in-lane unpacks, which sets you up to pack back into the original order.

This is cheap for zero-extending unpacks: _mm256_unpacklo/hi_epi16 (with _mm256_setzero_si256()). That's as cheap as vpmovzxwd (_mm256_cvtepu16_epi32), and is actually better because you can do 256-bit loads of your source data and unpack two ways, instead of narrow loads to feed vpmovzx... which only works on data at the bottom of an input register. (And memory-source vpmovzx... ymm, [mem] can't micro-fuse the load with a YMM destination, only for the 128-bit XMM version, on Intel CPUs, so the front-end cost is the same as separate load and shuffle instructions.)

But that trick doesn't work work quite as nicely for data you need to sign-extend. vpcmpgtw to get high halves for vpunpckl/hwd does work, but vpermq when re-packing is about as good, just different execution-port pressure. So vpmovsxwd is simpler there.

Slicing up your data into odd/even instead of low/high can also work, e.g. to get 16 bit elements zero-extended into 32-bit elements:

   auto  veven  = _mm256_and_si256(v, _mm256_set1_epi32(0x0000FFFF));
   auto  vodd   = _mm256_srli_epi32(v, 16);

After processing, one can recombine with a shift and vpblendw. (1 uop for port 5 on Intel Skylake / Ice Lake). Or for bytes, vpblendvb with a control vector, but that costs 2 uops on Intel CPUs (but for any port), vs. only 1 uop on Zen2. (Those uop counts aren't including the vpslld ymm, ymm, 16 shift to line up the odd elements back with their starting points.)

Even with AVX-512, the situation isn't perfect. You still can use a single shuffle uop to combine 2 vectors to one of the same width.

There's very nice single-vector narrowing with truncation, or signed or unsigned saturation, for any pair of element sizes like an inverse of vpmovzx / sx. e.g. qword to byte vpmov[su]qb, with an optional memory destination.

(Fun fact: vpmovdm [rdi]{k1}, zmm0 was the only way Xeon Phi (lacking both AVX-512BW and AVX-512VL) could do byte-masked stores to memory; that might be why these exist in memory-destination form. On mainstream Intel like Skylake-X / Ice Lake, the memory destination versions are no cheaper than separate pack into register then store. https://uops.info/)

AVX-512 also has nice 2-input shuffles with a control vector, so for dword-to-word truncation you could use vpermt2w zmm1, zmm2, zmm3. But that needs a shuffle control vector, and vpermt2w is 3 uops on SKX and IceLake. (t2d and t2q are 1 uop). vpermt2b is only available in AVX-512VBMI (Ice Lake), and is also 3 uops there.

(Unlike vpermb which is 1 uop on Ice Lake, and AVX-512BW vpermw which is still 2 uops on Ice Lake. So they didn't reduce the front-end cost of the backwards-compatible instruction, but ICL can run 1 of its 2 uops on port 0 or 1, instead of both on the shuffle unit on port 5. Perhaps ICL has one uop preprocess the shuffle control into a vpermb control or something, which would also explain the improved latency: 3 cycles for data->data, 4 cycles for control->data. vs. 6 cycles on SKX for the 2p5 uops, apparently a serial dependency starting with both the control and data vectors.)

_mm256_packs_epi32, except pack sequentially

1 Answers1