VZIP.32
is exactly what you are looking for
from MSB to LSB:
q0: A4 | A3 | A2 | A1
q1: B4 | B3 | B2 | B1
vzip.32 q0, q1
q0: B2 | A2 | B1 | A1
q1: B4 | A4 | B3 | A3
On aarch64
, it's quite different though.
from MSB to LSB:
v0: A4 | A3 | A2 | A1
v1: B4 | B3 | B2 | B1
zip2 v2.4s, v0.4s, v1.4s
zip1 v3.4s, v0.4s, v1.4s
v2: B2 | A2 | B1 | A1
v3: B4 | A4 | B3 | A3
And you shouldn't waste your time on intrinsics.
My assembly version 4x4 matrix multiplication (float, complex) runs almost three times as fast as my "spoon-fed" intrinsics version, compiled by Clang.
*The GCC (7.1.1) compiled version is slightly faster than the Clang counterpart, but not by much.
Below is the intrinsics version using 32-bit integers as an example. It works on A-32 NEON, Aarch32 and Aarch64.
uint32x4_t vecA, vecB;
...
uint32x4x2_t vecR = vzipq_u32(vecA, vecB);
uint32x4_t vecX = vecR.val[0];
uint32x4_t vecY = vecR.val[1];
Do note that vzip2
combines the first (lower) half while vzip1
does the second (upper) half. They are accessed by uint32x4x2_t
and val[0]
and val[1]
. Once the access to val[]
is made, the compiler can select either the zip1
and zip2
instruction.