Gather four 32-bit words from two 128-bit blocks

Question

I'm working on a port from SSE to NEON using C intrinsics. I have two 128-bit blocks made of 32-bit words:

[A1  A2  A3  A4] [B1  B2  B3  B4]

and I need to gather them into two uint32x4_t like so:

[A1  B1  A2  B2] [A3  B3  A4  B4]

The 128-bit blocks and their associated stride are giving me trouble. I've reviewed ARM's NEON Part 1: Load and Stores, but I don't see something that reaches across two 16-byte blocks.

How do I gather the data from the two 128-bit blocks?

With SSE, the instructions you want are `punpckldq` and `punpckhdq` to interleave two vectors, not scalar insert! What SSE lacks is a deinterleave like ARM's `vunzp`. — Peter Cordes, Dec 03 '17 at 21:17
@PeterCordes - Yeah... I need the SSE equivalent to ARM's `vunzp`. I actually asked the wrong question, but it was too late to change it once Jake provided an answer. — jww, Dec 04 '17 at 09:44
Actually `shufps` can do the reverse: A1 and A2 from your first "output" vector (going into the bottom 64 bits of destination), and A3 and A4 from the 2nd (going into the top 64 bits). (And yes it's worth using an FP shuffle on integer data. Nehalem will have a bit of extra latency, but still good throughput.) SSE instructions only ever have one vector output operand. — Peter Cordes, Dec 04 '17 at 09:59
Thanks Peter. Here's what I am trying to cleanup: [Speck : 1173](https://github.com/weidai11/cryptopp/blob/master/speck-simd.cpp#L1173). Forgive my ignorance... Are you suggesting two `shufps` with a `por`, which means 3 SSE insns to produce a new vector (as opposed to 8 or so). — jww, Dec 04 '17 at 12:42
No, one `shufps([A1 B1 A2 B2], [A3 B3 A4 B4], _MM_SHUFFLE(2,0,2,0))` or `(3,1,3,1)` per result vector. That should have been obvious if you look at the manual for shufps and think about how you could use it... If you need to avoid destroying one of the inputs, you'll need to copy one first. — Peter Cordes, Dec 04 '17 at 12:44
I just looked at the code you linked. Why would you ever write that? If you're just going to give up, store to arrays and use `_mm_set_epi32(blah blah)` to let the compiler do something less horrible than 16 shuffle uops + 16 integer<->xmm uops. (8 each `_mm_insert` / `_mm_extract`). `insert`/`extract` are *more* expensive than `_mm_shuffle_epi32` or `_mm_shuffle_epi32`. Since you're using `_mm_shuffle_epi8` anyway, you just need to get the right data into each vector in any order. (You could use different shuffle masks for the two vectors if necessary). You could have used pshufb / por — Peter Cordes, Dec 04 '17 at 12:51
Thanks Peter. No one has given up. Its on the TODO list because SSE lacks the intrinsics and search is failing me. — jww, Dec 04 '17 at 12:54
The code you linked is an example of (short term) giving up. The way it's written with intrinsics is maximally inefficient. I'm saying that nobody ever should have written that. (especially not when the next thing you do is `_mm_shuffle_epi8`! If you're shuffling elements one at a time, or with a flexible shuffle, combine it with any later shuffling.) Store and `_mm_setr_epi32(a[0], a[2], b[0], b[2])` (or whatever after you take the `_mm_shuffle_epi8` into account) would have been far easier to write and compiled at least as well. — Peter Cordes, Dec 04 '17 at 13:14
Anyway, clang has a good shuffle optimizer. You can often give it inefficient shuffles and it figures out something good. (But it can sometimes pessimize carefully-chosen shuffles, see https://stackoverflow.com/questions/6996764/fastest-way-to-do-horizontal-float-vector-sum-on-x86) — Peter Cordes, Dec 04 '17 at 13:17

score 3 · Accepted Answer · edited Dec 08 '17 at 09:38

3

VZIP.32 is exactly what you are looking for

from MSB to LSB:
q0: A4 | A3 | A2 | A1
q1: B4 | B3 | B2 | B1

vzip.32 q0, q1

q0: B2 | A2 | B1 | A1
q1: B4 | A4 | B3 | A3

On aarch64, it's quite different though.

from MSB to LSB:
v0: A4 | A3 | A2 | A1
v1: B4 | B3 | B2 | B1

zip2 v2.4s, v0.4s, v1.4s
zip1 v3.4s, v0.4s, v1.4s

v2: B2 | A2 | B1 | A1
v3: B4 | A4 | B3 | A3

And you shouldn't waste your time on intrinsics.

My assembly version 4x4 matrix multiplication (float, complex) runs almost three times as fast as my "spoon-fed" intrinsics version, compiled by Clang.

*The GCC (7.1.1) compiled version is slightly faster than the Clang counterpart, but not by much.

Below is the intrinsics version using 32-bit integers as an example. It works on A-32 NEON, Aarch32 and Aarch64.

uint32x4_t vecA, vecB;
...

uint32x4x2_t vecR = vzipq_u32(vecA, vecB);
uint32x4_t vecX = vecR.val[0];
uint32x4_t vecY = vecR.val[1];

Do note that vzip2 combines the first (lower) half while vzip1 does the second (upper) half. They are accessed by uint32x4x2_t and val[0] and val[1]. Once the access to val[] is made, the compiler can select either the zip1 and zip2 instruction.

edited Dec 08 '17 at 09:38

jww

97,681
90
411
885

answered Dec 03 '17 at 11:01

Jake 'Alquimista' LEE

6,197
2
17
25

@jww As a matter of fact, the intrinsics version wasn't much faster than the plain C version in my test run, and that pretty much defeats the point using NEON to start with. I hope you know what you are doing. 500k iterations: C: 196ms, Intrinsics: 152ms, asm: 60ms. It's anything else than just 0.1cpb. Good luck. – Jake 'Alquimista' LEE Dec 03 '17 at 14:09
@jww I just ran a benchmark on Galaxy S7 (aarch64, out-of-order), and it's 50ms vs 3ms. Unfortunately, I don't have A53 test board. Anyway, good luck again. – Jake 'Alquimista' LEE Dec 03 '17 at 17:44
@jww Oops, sorry. It's 52ms vs 29ms – Jake 'Alquimista' LEE Dec 03 '17 at 19:12
@jww Done. Would you please check if zip1 and zip2 work the way ARM's document says? http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.100069_0607_00_en/pge1425910926465.html – Jake 'Alquimista' LEE Dec 05 '17 at 11:31
By the way, here is why I try to stay firm on intrinsics: [Microsoft Showcases Qualcomm ARM-Based Windows 10 PCs Coming Next Year](https://redmondmag.com/articles/2017/12/06/qualcomm-arm-windows-10-pcs.aspx). We support Windows Phones, Windows tablets and upcoming platforms like the ARM Desktop. If we switch to GAS assembly then we loose the higher performance for Windows and ARM. We loose it because Microsoft does not provide an inline ARM assembler; and they don't document their stand alone ARM assembler. – jww Dec 08 '17 at 09:52
`vget_high` and `vget_low` are legacy stuff from ARMv7 where each quad(128bit) register was mapped to two double(64bit) registers (q0 = d0 and d1, q1 = d2 and d3, ....... q15 = d30 and d31). On `aarch64` however, the registers are mapped one to one. d0 = lower half of v0, d1 = lower half of v1, ...... d31 = lower half of v31. Another reason against intrinsics: if you aren't aware of these legacy stuff, you could end up severely harming the performance. Believe me: If your problem requires roughly more than ten registers, the compilers starts generating FUBAR machine codes. – Jake 'Alquimista' LEE Dec 08 '17 at 17:58
I'll be writing a blog post on this. Can you recommend a good meta site for this purpose? – Jake 'Alquimista' LEE Dec 08 '17 at 17:59
Sorry, no recommendations on blogs. We were getting a bad interaction with GCC and Aarch32/Aarch64 table based rotates using intrinsics. The intrinsics were producing a bad result for some reason, and we had to disable the table based rotates for the SPECK-64 cipher. The SIMON-64 cipher, which used nearly the same code (only the round function was different), worked fine. It is a mystery to me why only SPECK-64 had troubles. Also see [`WORKAROUND_GCC_AARCH64_BUG`](https://github.com/weidai11/cryptopp/blob/master/speck-simd.cpp). – jww Dec 10 '17 at 02:43
@jww Welcome to the wonderful world of intrinsics where you just have to accept what you get, where even the slightest differences make huge negative impacts for unknown reasons. My question is: Why beg for something if you can order it to get done? Shall I assist you with my assembly skills? Which function in particular are you trying to optimize? – Jake 'Alquimista' LEE Dec 10 '17 at 06:09
@jww btw, you should amend the `unpack64`(low/high) to `vtrn1` and `vtrn2`. NEON has much better permutation instructions than SSE/AVX. – Jake 'Alquimista' LEE Dec 10 '17 at 13:45

Gather four 32-bit words from two 128-bit blocks

1 Answers1