sse/avx equivalent for neon vuzp

Question

Intel's vector extensions SSE, AVX, etc. provide two unpack operations for each element size, e.g. SSE intrinsics are _mm_unpacklo_* and _mm_unpackhi_*. For 4 elements in a vector, it does this:

inputs:      (A0 A1 A2 A3) (B0 B1 B2 B3)
unpacklo/hi: (A0 B0 A1 B1) (A2 B2 A3 B3)

The equivalent of unpack is vzip in ARM's NEON instruction set. However, the NEON instruction set also provides the operation vuzp which is the inverse of vzip. For 4 elements in a vector, it does this:

inputs: (A0 A1 A2 A3) (B0 B1 B2 B3)
vuzp:   (A0 A2 B0 B2) (A1 A3 B1 B3)

How can vuzp be implemented efficiently using SSE or AVX intrinsics? There doesn't seem to be an instruction for it. For 4 elements, I assume it can be done using a shuffle and a subsequent unpack moving 2 elements:

inputs:        (A0 A1 A2 A3) (B0 B1 B2 B3)
shuffle:       (A0 A2 A1 A3) (B0 B2 B1 B3)
unpacklo/hi 2: (A0 A2 B0 B2) (A1 A3 B1 B3)

Is there a more efficient solution using a single instruction? (Maybe for SSE first - I'm aware that for AVX we may have the additional problem that shuffle and unpack don't cross lanes.)

Knowing this may be useful for writing code for data swizzling and deswizzling (it should be possible to derive deswizzling code just by inverting the operations of swizzling code based on unpack operations).

Edit: Here is the 8-element version: This is the effect of NEON's vuzp:

input:         (A0 A1 A2 A3 A4 A5 A6 A7) (B0 B1 B2 B3 B4 B5 B6 B7)
vuzp:          (A0 A2 A4 A6 B0 B2 B4 B6) (A1 A3 A5 A7 B1 B3 B5 B7)

This is my version with one shuffle and one unpack for each output element (seems to generalize to larger element numbers):

input:         (A0 A1 A2 A3 A4 A5 A6 A7) (B0 B1 B2 B3 B4 B5 B6 B7)
shuffle:       (A0 A2 A4 A6 A1 A3 A5 A7) (B0 B2 B4 B6 B1 B3 B5 B7)
unpacklo/hi 4: (A0 A2 A4 A6 B0 B2 B4 B6) (A1 A3 A5 A7 B1 B3 B5 B7)

The method suggested by EOF is correct but would require log2(8)=3 unpack operations for each output:

input:         (A0 A1 A2 A3 A4 A5 A6 A7) (B0 B1 B2 B3 B4 B5 B6 B7)
unpacklo/hi 1: (A0 B0 A1 B1 A2 B2 A3 B3) (A4 B4 A5 B5 A6 B6 A7 B7)
unpacklo/hi 1: (A0 A4 B0 B4 A1 A5 B1 B5) (A2 A6 B2 B6 A3 A7 B3 B7)
unpacklo/hi 1: (A0 A2 A4 A6 B0 B2 B4 B6) (A1 A3 A5 A7 B1 B3 B5 B7)

@Mysticial: In which respect does AVX512 fix it? Is there an inverse unpack, or did they give up the lane-oriented processing? About the latter: According to the Intel Intrinsics Guide, e.g. unpacks are still lane oriented in AVX512 (e.g. `_mm512_unpackhi_epi8`: "Unpack and interleave 8-bit integers from the high half of each 128-bit lane in a and b, and store the results in dst."). — Ralf, Jul 28 '17 at 19:15
@Ralf Just `unpack[lo/hi]` again `log2(vectorlength)` times. zip/unzip is circular. — EOF, Jul 28 '17 at 19:18
@Mysticial: Thanks, these instructions are really useful. Just as convenience for others: the corresponding intrinsics are `_mm512_permutex2var_*` (available for all integer types, float, and double), also available as `_mm512_mask_permutex2var_*`. — Ralf, Jul 29 '17 at 10:00
@EOF: I'm not sure whether this fits with my intended application of deinterleaving / interleaving. A student of my SIMD course found this interesting answer: https://stackoverflow.com/a/15377386/3852630 Please have a look at the slides 17-19. The clever idea seems to be to load twice the number of vectors (6 instead of 3 for deinterleaving of 3-element inputs) and then use repeated `unpack` operations to deinterleave them. The structure is very regular and probably generalizes to arbitrary element size. I tried applying the same scheme to the output, but it doesn't seem to be circular. — Ralf, Jul 29 '17 at 10:15
@EOF: Now I understand your comment: In my example, you could also apply a 1-element `unpack` twice (`log2(4)` times) to arrive at the output of `vuzp`. However, for larger number of elements this would be inefficient. — Ralf, Jul 29 '17 at 10:43
@Ralf - FWIW your `input`/`output` notation is a bit unclear to me. Most of those instructions take two input vectors and produce one output vector (pre-VEX x86 stuff overlaps the output with one of the inputs so that doesn't really change things here). So shouldn't the "output" line have half the number of vectors as the input? Or it showing the result of two different instructions (e.g., the "hi" and "lo" cases side-by-side)? What about the `vzup` case (maybe `vzup` is two output)? — BeeOnRope, Jul 30 '17 at 19:15
@BeeOnRope - NEON's `vuzp` actually uses two registers as input and the same two registers as output. Intel instructions/intrinsics only have a single vector output, so, as you say, each line is produced by two instructions (e.g. `unpacklo` and `unpackhi`). So the minimum is 2 instructions (e.g. 2 times `shuffle_ps` as in the answer by Peter Cordes), my combination (`shuffle` plus `unpack`) uses 4. — Ralf, Jul 31 '17 at 17:22
Thanks @Ralf - that's interesting. I wonder if internally `vzup` is split into multiple internal operations since most modern CPU designs want to have exactly one destination register for each micro-operation since renaming is often a hard limit. — BeeOnRope, Jul 31 '17 at 17:30
My solution is identical to the one suggested here (shuffle even and odd values): https://stackoverflow.com/q/20504618/3852630 — Ralf, Dec 10 '17 at 10:50

Peter Cordes · Accepted Answer · 2017-07-29T04:27:46.577

it should be possible to derive deswizzling code just by inverting the operations

Get used to being disappointed and frustrated by the non-orthogonality of Intel's vector shuffles. There is no direct inverse for punpck. The SSE/AVX pack instructions are for narrowing the element size. (So one packusdw is the inverse of punpck[lh]wd against zero, but not when used with two arbitrary vectors). Also, pack instructions are only available for 32->16 (dword to word) and 16->8 (word to byte) element size. There is no packusqd (64->32).

PACK instructions are only available with saturation, not truncation (until AVX512 vpmovqd), so for this use-case we'd need to prepare 4 different input vectors for 2 PACK instructions. This turns out to be horrible, much worse than your 3-shuffle solution (see unzip32_pack() in the Godbolt link below).

There is a 2-input shuffle that will do what you want for 32-bit elements, though: shufps. The low 2 elements of the result can be any 2 elements of the first vector, and the high 2 element can be any elements of the second vector. The shuffle we want fits those constraints, so we can use it.

We can solve the whole problem in 2 instructions (plus a movdqa for the non-AVX version, because shufps destroys the left input register):

inputs: a=(A0 A1 A2 A3) a=(B0 B1 B2 B3)
_mm_shuffle_ps(a,b,_MM_SHUFFLE(2,0,2,0)); // (A0 A2 B0 B2)
_mm_shuffle_ps(a,b,_MM_SHUFFLE(3,1,3,1)); // (A1 A3 B1 B3)

_MM_SHUFFLE() uses most-significant-element first notation, like all of Intel's documentation. Your notation is opposite.

The only intrinsic for shufps uses __m128 / __m256 vectors (float not integer), so you have to cast to use it. _mm_castsi128_ps is a reinterpret_cast: it compiles to zero instructions.

#include <immintrin.h>
static inline
__m128i unziplo(__m128i a, __m128i b) {
    __m128 aps = _mm_castsi128_ps(a);
    __m128 bps = _mm_castsi128_ps(b);
    __m128 lo = _mm_shuffle_ps(aps, bps, _MM_SHUFFLE(2,0,2,0));
    return _mm_castps_si128(lo);
}

static inline    
__m128i unziphi(__m128i a, __m128i b) {
    __m128 aps = _mm_castsi128_ps(a);
    __m128 bps = _mm_castsi128_ps(b);
    __m128 hi = _mm_shuffle_ps(aps, bps, _MM_SHUFFLE(3,1,3,1));
    return _mm_castps_si128(hi);
}

gcc will inline these to a single instruction each. With the static inline removed, we can see how they'd compile as non-inline functions. I put them on the Godbolt compiler explorer

unziplo(long long __vector(2), long long __vector(2)):
    shufps  xmm0, xmm1, 136
    ret
unziphi(long long __vector(2), long long __vector(2)):
    shufps  xmm0, xmm1, 221
    ret

Using FP shuffles on integer data is fine on recent Intel/AMD CPUs. There is no extra bypass-delay latency (See this answer which summarizes what Agner Fog's microarch guide says about it). It has extra latency on Intel Nehalem , but may still be the best choice there. FP loads/shuffles won't fault or corrupt integer bit-patterns that represent a NaN, only actual FP math instructions care about that.

Fun fact: on AMD Bulldozer-family CPUs (and Intel Core2), FP shuffles like shufps still run in the ivec domain, so they actually have extra latency when used between FP instructions, but not between integer instructions!

Unlike ARM NEON / ARMv8 SIMD, x86 SSE doesn't have any 2-output-register instructions, and they're rare in x86. (They exist, e.g. mul r64, but always decode to multiple uops on current CPUs).

It's always going to take at least 2 instructions to create 2 vectors of results. It would be ideal if they didn't both need to run on the shuffle port, since recent Intel CPUs have a shuffle throughput of only 1 per clock. Instruction-level parallelism doesn't help much when all your instructions are shuffles.

For throughput, 1 shuffle + 2 non-shuffles could be more efficient than 2 shuffles, and have the same latency. Or even 2 shuffles and 2 blends could be more efficient than 3 shuffles, depending on what the bottleneck is in the surrounding code. But I don't think we can replace 2x shufps with that few instructions.

Without `SHUFPS`:

Your shuffle + unpacklo/hi is pretty good. It would be 4 shuffles total: 2 pshufd to prepare the inputs, then 2 punpckl/h. This is likely to be worse than any bypass latency, except on Nehalem in cases where latency matters but throughput doesn't.

Any other option would seem to require preparing 4 input vectors, for either a blend or packss. See @Mysticial's answer to _mm_shuffle_ps() equivalent for integer vectors (__m128i)? for the blend option. For two outputs, that would take a total of 4 shuffles to make the inputs, and then 2x pblendw (fast) or vpblendd (even faster).

Using packsswd or wb for 16 or 8 bit elements would also work. It would take 2x pand instructions to mask off the odd elements of a and b, and 2x psrld to shift the odd elements down to the even positions. That sets you up for 2x packsswd to create the two output vectors. 6 total instructions, plus many movdqa because those all destroy their inputs (unlike pshufd which is a copy+shuffle).

// don't use this, it's not optimal for any CPU
void unzip32_pack(__m128i &a, __m128i &b) {
    __m128i a_even = _mm_and_si128(a, _mm_setr_epi32(-1, 0, -1, 0));
    __m128i a_odd  = _mm_srli_epi64(a, 32);
    __m128i b_even = _mm_and_si128(b, _mm_setr_epi32(-1, 0, -1, 0));
    __m128i b_odd  = _mm_srli_epi64(b, 32);
    __m128i lo = _mm_packs_epi16(a_even, b_even);
    __m128i hi = _mm_packs_epi16(a_odd, b_odd);
    a = lo;
    b = hi;
}

Nehalem is the only CPU where it might be worth using something other than 2x shufps, because of it's high (2c) bypass delay. It has 2 per clock shuffle throughput, and pshufd is a copy+shuffle, so 2x pshufd to prepare copies of a and b would only need one extra movdqa after that to get the punpckldq and punpckhdq results into separate registers. (movdqa isn't free; it has 1c latency and needs a vector execution port on Nehalem. It's only cheaper than a shuffle if you're bottlenecked on shuffle throughput, rather than overall front-end bandwidth (uop throughput) or something.)

I very much recommend just using 2x shufps. It will be good on the average CPU, and not horrible anywhere.

AVX512

AVX512 introduced a lane-crossing pack-with-truncation instruction that narrows a single vector (instead of being a 2-input shuffle). It's the inverse of pmovzx, and can narrow 64b->8b or any other combination, instead of only by a factor of 2.

For this case, __m256i _mm512_cvtepi64_epi32 (__m512i a) (vpmovqd) will take the even 32-bit elements from a vector and pack them together. (i.e. the low halves of each 64-bit element). It's still not a good building block for an interleave, though, since you need something else to get the odd elements into place.

It also comes in signed/unsigned saturation versions. The instructions even have a memory-destination form that the intrinsics expose to let you do a masked-store.

But for this problem, as Mysticial points out, AVX512 provides 2-input lane-crossing shuffles which you can use like shufps to solve the whole problem in just two shuffles: vpermi2d/vpermt2d.

Thanks a lot, great answer! `_mm_shuffle_ps` seems to be the optimal solution for 32-bit types. My solution (`shuffle` + `unpack`) would also work for smaller types (requiring 2 operations for each output), but has the disadvantage that it needs `_mm_shuffle_epi8` where the shuffle mask comes from a vector, not from an immediate. This makes it inefficient if you want to encapsulate it in an unzip function. Any idea on this? — Ralf, Jul 30 '17 at 12:50
@Ralf: Compilers will hoist `_mm_set_epi8(...)` constants out of loops after inlining. Just write it "naively", with `_mm_shuffle_epi8(v, _mm_set_epi8(...));`, unless you're using MSVC (which I think would fail to hoist the constants after inlining.) Vector constants are like string literals: multiple functions that use the same constant end up sharing an actual definition. Making a `static const __m128i` will actually be worse. — Peter Cordes, Jul 30 '17 at 12:53
I'm one of those frustrated by AVX, just like you mentioned. Compared to AVX/2, NEON is godsend. Since I've been working on AVX2 after NEON, I understand why Intel totally failed in the mobile sector. Their instruction set simply SUCKS. — Jake 'Alquimista' LEE, Oct 24 '17 at 14:04
@Jake: The in-lane design of AVX / AVX2 is really horrible for a lot of stuff, but I guess it saves transistors. They went the other way with AVX512, with powerful shuffles that don't care about 128b lanes. But the smallest granularity versions of those decode to multiple uops on Skylake-AVX512 (`vpermt2w` and `vpermw`). Maybe they'll be 1c throughput on some future implementation, but for now the instruction-set is too expensive to fully implement in hardware. — Peter Cordes, Oct 24 '17 at 21:21
@Jake: I don't know NEON very well at all, but it seems to be missing [a `pmovmskb` equivalent](https://stackoverflow.com/questions/11870910/sse-mm-movemask-epi8-equivalent-method-for-arm-neon) to get an integer bit-mask of a vector compare result. (Or the sign bit of every element if it's not a compare result). Using it to index a table of shuffle-control vectors [can do cool stuff](https://stackoverflow.com/q/31679341/224132). Integer <-> XMM is cheap (for a single integer) and low latency (3 cycles on Intel CPUs). Apparently that's a lot more expensive on ARM? — Peter Cordes, Oct 24 '17 at 21:33
@PeterCordes What a coincident. I did this in my most recent project, and it's doable with four instructions on NEON. Even though I really appreciated pmovmaskb on AVX2, I'm worried about this moving to GPR from SIMD, thus potentially causing a pipeline stall. Moving to GPR from SIMD DOES cause a heavy hiccup on NEON. Could you tell me if it's the case on AVX2? — Jake 'Alquimista' LEE, Oct 26 '17 at 05:45
BTW, the four-instructions method I mentioned above isn't that bad, considering that it doesn't involve GPRs and only one extra SIMD register is required. — Jake 'Alquimista' LEE, Oct 26 '17 at 05:48
It's never a pipeline stall (out-of-order execution can always schedule around it), but it is highish latency on AMD Bulldozer-family where two integer cores share a SIMD unit. (like 10 cycles). It's fine on other AMD designs as well. See http://agner.org/optimize/ for instruction tables and microarch guide. e.g. on Intel pre-Skylake, `movd eax, xmm0` is 3 cycle latency (1c throughput), and so is `movd xmm0, eax`. Same for `pmovmskb`: 3c latency, 1c throughput even for ymm vectors. SIMD -> integer domain has to be fast for FP compares that set flags: `ucomiss xmm0, xmm1` / `jp .nan` — Peter Cordes, Oct 26 '17 at 05:51
@Jake'Alquimista'LEE: ... AVX512 even added instructions to *broadcast* GP registers to SIMD. https://hjlebbink.github.io/x86doc/html/VPBROADCASTB_W_D_Q.html. They're also efficient, on Skylake-AVX512: 1 ALU uop for port 5, with 3 cycle latency. Agner Fog doesn't list timings for them on KNL, but `movd` from GP to xmm is 5c latency and 0.5c (reciprocal) throughput. A lot of GP <-> SIMD is a throughput bottleneck just because it takes a lot of instructions to deal with all the elements, not because of any stalls on any microarchitectures. — Peter Cordes, Oct 26 '17 at 05:58
@PeterCordes Thank you for the information. On ARM, pretty much every NEON instruction takes one cycle execution time plus roughly 3~4 cycles latency. And you have to take the in-order LITTLE cluster into account, but it's quite manageable by unrolling deeper enabled by the sheer number of registers. I realized that AVX requires a lot more registers than NEON for all those shuffle-vectors, and that's why Intel failed so miserably in the mobile sector with their in-order Atoms IMO. — Jake 'Alquimista' LEE, Oct 26 '17 at 06:00
@Jake'Alquimista'LEE: I doubt that's specifically why Intel didn't do well vs. ARM. More likely it's a combination of ecosystem and the extra cost of the "x86 tax" (harder to decode thanks to years of short-sighted ISA extensions and not cooperating with AMD, plus baseline x86 complexity). Atom doesn't even support AVX anyway, just SSE4. Anyway, yeah probably x86 is a worse ISA in general for in-order CPUs like Atom, especially 32-bit x86 with only 7 GP registers + stack pointer. Atom does have *very* fast store-forwarding (1c latency), but spill/reload still costs insns. — Peter Cordes, Oct 26 '17 at 06:10
@PeterCordes There are pros and cons. I really appreciated some aspects of AVX2 such as pmovmskb, but AVX lacks CLZ, RBIT, UABD(unsigned absolute difference), and the absence of U8 arithmetic was the last straw that broke the camel's back for me. I needed all of them for my most recent project. Of course I found workarounds for everything, but it DID take extra instructions, and more importantly, extra registers that DOES hurt the performance even worse, especially on in-order architectures such as Atom. Overall, I'm really disappointed in AVX, and AVX512 isn't an option for many, if not most. — Jake 'Alquimista' LEE, Oct 26 '17 at 06:20
@Jake'Alquimista'LEE: AVX512 adds a lot of things that it would have been *really* nice to have sooner, agreed that there are big holes even in SSE4 / AVX2 :/ AVX512 has a SIMD lzcnt. I think not a SIMD bit-reverse though. SSE2 has *sum* of unsigned absolute differences (`psadbw`), min/max and saturating +/- UB. SSSE3 has `pmaddubsw` which is pretty specialized. It depends what you need to do with a U8; some things are efficient in SSE, others aren't. I'm not trying to defend SSE2 / AVX2 as a great design, but with experience you learn some efficient workarounds for the missing stuff. — Peter Cordes, Oct 26 '17 at 06:28
@PeterCordes It was a really constructive conversation. Thank you very much. Should you have questions about NEON, I'll do my best answering them. See you next time. PS: I didn't need the sum, but ABDs themselves. — Jake 'Alquimista' LEE, Oct 26 '17 at 06:38
@Jake'Alquimista'LEE: Yeah, I figured you'd have found `psadbw` if you needed that. Nice chatting with you; I haven't done any ARM optimization myself, so I didn't know NEON instructions were usually such high latency. BTW, if you have time maybe you could post your `pmovmskb/ps` 4-instruction trick on https://stackoverflow.com/questions/11870910/sse-mm-movemask-epi8-equivalent-method-for-arm-neon, because the existing answers look clunky. There are probably some other questions where it would be useful (e.g. I think there was one about left-packing). — Peter Cordes, Oct 26 '17 at 06:47

sse/avx equivalent for neon vuzp

1 Answers1

Without `SHUFPS`:

AVX512

Linked

sse/avx equivalent for neon vuzp

1 Answers1

Without SHUFPS:

AVX512

Linked

Without `SHUFPS`: