No sequential-across-lanes pack until AVX-512, unfortunately. (And even then only for 1 register, or not with saturation.)
The in-lane behaviour of shuffles like vpacksswd
and vpalignr
is one of the major warts of AVX2 that make the 256-bit versions of those shuffles less useful than their __m128i
versions. But on Intel, and Zen2 CPUs, it is often still best to use __m256i
vectors with a vpermq
at the end, if you need the elements in a specific order. (Or vpermd
with a vector constant after 2 levels of packing: How do I efficiently reorder bytes of a __m256i vector (convert int32_t to uint8_t)?)
If your 32-bit elements came from unpacking narrower elements, and you don't care about order of the wider elements, you can widen with in-lane unpacks, which sets you up to pack back into the original order.
This is cheap for zero-extending unpacks: _mm256_unpacklo/hi_epi16
(with _mm256_setzero_si256()
). That's as cheap as vpmovzxwd
(_mm256_cvtepu16_epi32
), and is actually better because you can do 256-bit loads of your source data and unpack two ways, instead of narrow loads to feed vpmovzx...
which only works on data at the bottom of an input register. (And memory-source vpmovzx... ymm, [mem]
can't micro-fuse the load with a YMM destination, only for the 128-bit XMM version, on Intel CPUs, so the front-end cost is the same as separate load and shuffle instructions.)
But that trick doesn't work work quite as nicely for data you need to sign-extend. vpcmpgtw
to get high halves for vpunpckl/hwd
does work, but vpermq
when re-packing is about as good, just different execution-port pressure. So vpmovsxwd
is simpler there.
Slicing up your data into odd/even instead of low/high can also work, e.g. to get 16 bit elements zero-extended into 32-bit elements:
auto veven = _mm256_and_si256(v, _mm256_set1_epi32(0x0000FFFF));
auto vodd = _mm256_srli_epi32(v, 16);
After processing, one can recombine with a shift and vpblendw
. (1 uop for port 5 on Intel Skylake / Ice Lake). Or for bytes, vpblendvb
with a control vector, but that costs 2 uops on Intel CPUs (but for any port), vs. only 1 uop on Zen2. (Those uop counts aren't including the vpslld ymm, ymm, 16
shift to line up the odd elements back with their starting points.)
Even with AVX-512, the situation isn't perfect. You still can use a single shuffle uop to combine 2 vectors to one of the same width.
There's very nice single-vector narrowing with truncation, or signed or unsigned saturation, for any pair of element sizes like an inverse of vpmovzx
/ sx
. e.g. qword to byte vpmov[su]qb
, with an optional memory destination.
(Fun fact: vpmovdm [rdi]{k1}, zmm0
was the only way Xeon Phi (lacking both AVX-512BW and AVX-512VL) could do byte-masked stores to memory; that might be why these exist in memory-destination form. On mainstream Intel like Skylake-X / Ice Lake, the memory destination versions are no cheaper than separate pack into register then store. https://uops.info/)
AVX-512 also has nice 2-input shuffles with a control vector, so for dword-to-word truncation you could use vpermt2w zmm1, zmm2, zmm3
. But that needs a shuffle control vector, and vpermt2w
is 3 uops on SKX and IceLake. (t2d
and t2q
are 1 uop). vpermt2b
is only available in AVX-512VBMI (Ice Lake), and is also 3 uops there.
(Unlike vpermb
which is 1 uop on Ice Lake, and AVX-512BW vpermw
which is still 2 uops on Ice Lake. So they didn't reduce the front-end cost of the backwards-compatible instruction, but ICL can run 1 of its 2 uops on port 0 or 1, instead of both on the shuffle unit on port 5. Perhaps ICL has one uop preprocess the shuffle control into a vpermb control or something, which would also explain the improved latency: 3 cycles for data->data, 4 cycles for control->data. vs. 6 cycles on SKX for the 2p5 uops, apparently a serial dependency starting with both the control and data vectors.)