2

Following my x86 question, I would like to know how it is possible to vectorized efficiently the following code on Arm-v8:


static inline uint64_t Compress8x7bit(uint64_t x) {
  x = ((x & 0x7F007F007F007F00) >> 1) | (x & 0x007F007F007F007F);
  x = ((x & 0x3FFF00003FFF0000) >> 2) | (x & 0x00003FFF00003FFF);
  uint64_t res = ((x & 0x0FFFFFFF00000000) >> 4) | (x & 0x000000000FFFFFFF);
  
  /* does the following:
   uint64_t res = (x & 0xFF);
   for (unsigned i = 1; i <= 7; ++i) {
      x >>= 1;
      res |= (x & (0x7FUL << 7 * i));
   }
  */
  return res;
}

void ascii_pack2(const char* ascii, size_t len, uint8_t* bin) {
  uint64_t val;
  const char* end = ascii + len;

  while (ascii + 8 <= end) {
    memcpy(&val, ascii, 8);
    val = Compress8x7bit(val);
    memcpy(bin, &val, 8);
    bin += 7;
    ascii += 8;
  }

  // epilog - we do not pack since we have less than 8 bytes.
  while (ascii < end) {
    *bin++ = *ascii++;
  }
}
Roman
  • 1,351
  • 11
  • 26
  • Do you have an attempt using intrinsics as a starting point? BTW, in the pure C version, the `memcpy` store is probably best done with an 8-byte `memcpy` so it can just be one unaligned `str` instruction. The next 8-byte store will overlap with it by 1, and that's fine. Adjust the loop condition accordingly to not write past the end, although it looks like you already check a conservative condition. Oh, I see, you don't even pack the tail since it would save less than 1 byte. Makes sense. – Peter Cordes Dec 19 '22 at 05:51
  • I used sse2neon and the SIMD implementation from x86 question - it gave me 50% improvement, nowhere close to what x86 gives me (x4-x5). – Roman Dec 19 '22 at 05:54
  • On your simple intrinsics implementation with just shifts and ORs (in the question), or (a 128-bit version of) chtz's answer using `_mm256_maddubs_epi16` and `_mm256_shuffle_epi8`? I'd expect the shift/OR to be not bad, although perhaps AArch64 SIMD has some tricks available that can do even better. – Peter Cordes Dec 19 '22 at 05:56
  • shift and ors gain 50% improvement. mm_maddubs_epi16 make it slower. https://github.com/dragonflydb/dragonfly/blob/main/src/core/detail/bitpacking.cc#L164 so currently I the committed version is the slower one. – Roman Dec 19 '22 at 05:59
  • You are testing repeated loops over a small enough buffer for some fast level of cache to work, right? AArch64 is probably pretty good at this even with scalar code, with efficient bit-pattern immediates for bitwise booleans like AND, and can combine shift+or into one scalar instruction. Even on x86-64, I'm a bit surprised you'd get a 5x speedup with just 128-bit SIMD. I guess the multiply trick does save a lot of instructions, though. – Peter Cordes Dec 19 '22 at 06:00
  • here is my testing code: https://github.com/dragonflydb/dragonfly/blob/main/src/core/compact_object_test.cc#L553 – Roman Dec 19 '22 at 06:01
  • Here is the godbolt link: https://godbolt.org/z/hr5hhbo8h I am not a low-level guy but I do not see any special optimizations there – Roman Dec 19 '22 at 06:04
  • https://developer.arm.com/documentation/102159/0400/Shifting-left-and-right shows the relevant AArch64 shift instructions, and an example of RGB565 to or from RGB888 unpacking / packing. `sri` (shift right and insert) is indeed useful. Your problem might be pretty similar since you don't need to move bits across wider element boundaries until the end. – Peter Cordes Dec 19 '22 at 06:49
  • https://arm-software.github.io/acle/neon_intrinsics/advsimd.html - I think `v = vsriq_n_u16(v, v, 1);` `v = vsriq_n_u32(v,v,2);` `v = vsriq_n_u64(v,v,4);` might do the trick for the first 3 steps. If I'm understanding the docs right about which bits it keeps from the non-shifted operand. I'm not sure I am. – Peter Cordes Dec 19 '22 at 08:08
  • Ok finally found decent documentation for sri: https://developer.arm.com/documentation/ddi0596/2020-12/SIMD-FP-Instructions/SRI--Shift-Right-and-Insert--immediate-- . No, the bits kept are only the ones where zeros were shifted in. So it would have to be `v>>8` then `sli` by #7, 2 shifts per step if doing it that way. So that's not ideal. – Peter Cordes Dec 19 '22 at 08:28
  • `USHL` - https://developer.arm.com/documentation/ddi0596/2020-12/SIMD-FP-Instructions/USHL--Unsigned-Shift-Left--register-- - per-element variable-count shifts can shift left or right depending on the sign of the shift count. So first step can left shift the even elements by 1, joining into 14-bit groups in the middle of u16 elements. Next step can shift left+right into the middle of u32, etc. Then one final right-shift of a full u64, and byte shuffle. Also interesting was `uhadd`, but that would take an AND: `uhadd(v.4s, v.4s&0x00ff00ff..)` to right-shift the high halves by not self-adding – Peter Cordes Dec 19 '22 at 08:55
  • You should consider transposing the 8x8 matrix. Then you can right shift each row 0 to 7 (`vshr`), and left shift insert (`vsli`) next rows each. You will have a transposed 8x7 matrix that you can store lane by lane (`vst4_lane` / `vst3_lane`) – Jake 'Alquimista' LEE Dec 19 '22 at 11:22

3 Answers3

3

ARM NEON / AArch64 SIMD has very nice variable-count shift instructions where a positive count is a left shift, negative count is a right shift for that element. Unlike x86-64, it even has these for 8 and 16-bit elements. Specifically, ushl, unsigned left shift.1

That's quite handy for unpacking, letting us center the packed bits in a u64, so the to 4 bitfields are in the high 32, the low 4 are in the low 32 bits. Then do the same thing with centering in 32-bit elements, etc. So it just takes one shift at each step, no masking.

Unfortunately I didn't find a way to avoid a final AND. Since most of your loads from the binary data will be unaligned, we might as well avoid a shuffle by making all of them unaligned. But unfortunately that leaves 8 bits of high garbage at the top, one of which survives until the end. Shifting farther left to knock it off at any point would put lower bits in that element to the left of the element boundary for the next shift using narrower elements.

Untested, and I haven't played around much with AArch64 so I'm basing this on the docs. And I know very little about throughput of different asm choices on various AArch64 CPUs, like if ushl v,v,v can only run on one execution port on some CPUs. If this hits any major potholes, please let me know.

#include <arm_neon.h>

uint8x16_t ascii_unpack_a64(uint64x2_t v64)
{
    // v loaded from pBinary-1, so 8 characters are in each half.
    // Otherwise,  v = shuffle(v) to make that happen
    
    // hi   xHGFEDBCA | HGFEDBCAx   lo   // input value, where x is 8 bits of garbage.  (later comments: 1 bit per x)

    int64x2_t center_64 = {4, -4};
    uint32x4_t v32 = vreinterpretq_u32_u64(vshlq_u64(v64, center_64));  // xxxxHGFE|DBCA0000 | 0000HGFEDBCAxxxx
    // the 64-bit halves are now symmetric, except for where the non-zero garbage is
    int32x4_t center_32 = {2, -2, 2, -2};
    uint16x8_t v16 = vreinterpretq_u16_u32(vshlq_u32(v32, center_32));  // xxHGFE00|00DBCA00 | 00HGFE00|00DBCAxx

    int16x8_t center_16 = {1, -1, 1, -1, 1, -1, 1, -1};
    uint8x16_t v8 = vreinterpretq_u8_u16(vshlq_u16(v16, center_16));     // xHG0|0FE0 | 0DB0|0CA0 | 0HG0|0FE0 | 0DB0|0CAx
    int8x16_t shr_evens = vreinterpretq_s8_s16(vdupq_n_s16(0x00FE));  // repeat 0, -1
    v8 = vshlq_u8(v8, shr_evens);                                     // xH0G|0F0E | 0D0B|0C0A | 0H0G|0F0E | 0D0B|0C0A

    v8 = vandq_u8(v8, vdupq_n_u8(0x7F));  // Just because of one pesky bit that might not be zero :/
    return v8;
}

Godbolt

// GCC -O3 -Wall  -mcpu=neoverse-n2
ascii_unpack_a64(__Uint64x2_t):
        adrp    x0, .LC0
        movi    v2.8h, 0xfe         // some constants can be materialized from immediates
        movi    v1.16b, 0x7f
        ldr     q5, [x0, #:lo12:.LC0]   // others it loads from .rodata
        adrp    x0, .LC1
        ldr     q4, [x0, #:lo12:.LC1]
        adrp    x0, .LC2
        ldr     q3, [x0, #:lo12:.LC2]
  // constant setup all done, the above part will get hoisted out of loops
        ushl    v0.2d, v0.2d, v5.2d
        ushl    v0.4s, v0.4s, v4.4s
        ushl    v0.8h, v0.8h, v3.8h
        ushl    v0.16b, v0.16b, v2.16b
        and     v0.16b, v0.16b, v1.16b
        ret

So that's 5 instructions per 16 characters, 4 of them shifts, not counting load and store. TODO: use bic immediate bit-clear. Instead of repeating bytes of 0x7f, it could be any element size. Only one byte has any garbage, and it's at the top of any size.

On Cortex-A76 for example (optimization guide), ushl v,v,v has 2 cycle latency, 1/clock throughput. (Regardless of 8-byte or 16-byte vector width.) Jake says some lower-end cores have half throughput for 16-byte vectors, in which case you might consider working in 8-byte chunks instead of 16-byte, avoiding a shuffle or having to load from before the start of the first element.

To balance back-end throughput better, you might have the 16-bit shift end up with the elements at the bottom of u16, instead of middle, like xxHG|00FE | 00DB|00CA. Then like in my x86-64 answer, 2x vand and 1x add to left-shift the high 7-bit field. The optimization manual strangely lists vand as 1/clock throughput, but says it can run on either ASIMD execution port. add has 2/clock throughput.

uhadd unsigned halving add is also 2/clock throughput, but its purpose is average without overflow, so it won't knock off the high bit before right-shifting by 1. It takes the top 8 bits of the 9-bit sum in each element, so we still can't get away with just one AND + UHADD.

Cortex-A76 is just a random choice of an out-of-order pipeline from 2018, with two SIMD execution ports. IDK if ARM cloud servers like Graviton or Neoverse are similar, but I'm guessing they might be.

That's not counting load and store. Store-pair only costs one instruction per two contiguous vectors of 32 bytes, and the output character data can hopefully be aligned. If we do use offset-by-1 loads, that would rule out ldp load-pair. If ldp is efficient when aligned so two 14-byte chunks split into separate q vectors, that would mean we need to shuffle or byte-shift within those q vectors.

The A76 optimization manual says quad-word (16-byte) loads are less efficient when not aligned by 4. ptr-1 loads will always be misaligned; pointer-increment by 14 will be aligned by 4 every other vector. (Some of those will cross cache-line boundaries which is also a slowdown.) So you might consider using tbl or some other shuffle instead of purely unaligned loads, on microarchitectures like A76 where tbl is fast when used with 1 or 2 vectors (2/clock throughput). Two tbl instructions could grab the right 14-byte windows from a pair of 16-byte loads.

Or with one register of real data and another of zeros, tbl could shuffle and introduce zeros in the high byte of each u64, avoiding the final and. (And avoiding one of the vector shift constants by lining up the data so that a simple immediate shift count works for the first shift, v <<= 4;)

I suspect a pack could cost a similar number of instructions, doing similar steps in the other order. If it's 5, that would be fewer instructions per byte than Jake's transpose idea (21 insn / 64B = 0.328 i/B. 5i/16B = 0.3125 i/B). But Jake is using 8-byte vectors so that costs more instructions. This isn't counting load or store instructions, and the transpose needs to do lots of small stores.

A76 is not fast at st3 or st4. e.g. ASIMD store, 3 element, one lane, B/H st3 has 0.5/clock throughput, and needs V (SIMD ALU) and L (load/store) pipelines, so it competes with the shuffle / shift work. The manual doesn't have complete details for st4, like ASIMD store, 4 element, one lane, B/H is listed as 5 cycle latency, but no throughput. V,L execution ports. The S (32-bit) element size is listed as 2/3 throughput, like 0.66 per cycle.


Footnote 1: There's also an sshl, signed shift, but I don't know why it exists when you're not using a saturating or rounding version of it. It's Int(Elem[operand1, e, esize], unsigned) pseudocode says it also treats its elements as unsigned, unless that's a typo in ARM's web site. Apparently the shift-count vector is always treated as signed, so I'm guessing it is an arithmetic right shift despite the online instruction reference not mentioning it. If there's better documentation somewhere, it's dumb that it's not in the pages google finds easily.

There's no ushr by register; if you want variable-count shifts, positive has to be left.


68 cycles, 128 bytes per iteration, optimized for Cortex-A55

// written by Jake Lee
    .arch armv8-a
    .global ascii_pack_asm_rbshift_q
    .text

pBin    .req    x0
pAscii  .req    x1
len     .req    w2

.balign 64
.func
ascii_pack_asm_rbshift_q:
    adr     x7, 2f
    add     x6, pAscii, #64
    mov     x5, #96
    movi    v0.8h, #0x0001      // shift8
    ldp     q1, q2, [x7]        // shift16, shift32
    b       1f

.balign 32
2:
    .short  1, -1, 1, -1, 1, -1, 1, -1
    .long   2, -2, 2, -2

.balign 64
1:
    ld4     {v16.d-v19.d}[0], [pAscii], #32     // 4, 6 (4 cycles, 6 latency)
    ld4     {v16.d-v19.d}[1], [x6], #32
    ld4     {v20.d-v23.d}[0], [pAscii], x5
    ld4     {v20.d-v23.d}[1], [x6], x5
// 16
    ushl    v16.16b, v16.16b, v0.16b    // 1, 2
    ushl    v17.16b, v17.16b, v0.16b
    ushl    v18.16b, v18.16b, v0.16b
    ushl    v19.16b, v19.16b, v0.16b
        ushl    v16.8h, v16.8h, v1.8h   // hide the final ld4's latency of 6 cycles
        ushl    v17.8h, v17.8h, v1.8h
    ushl    v20.16b, v20.16b, v0.16b
    ushl    v21.16b, v21.16b, v0.16b
    ushl    v22.16b, v22.16b, v0.16b
    ushl    v23.16b, v23.16b, v0.16b

        ushl    v18.8h, v18.8h, v1.8h
        ushl    v19.8h, v19.8h, v1.8h
        ushl    v20.8h, v20.8h, v1.8h
        ushl    v21.8h, v21.8h, v1.8h
        ushl    v22.8h, v22.8h, v1.8h
        ushl    v23.8h, v23.8h, v1.8h

    ushl    v16.4s, v16.4s, v2.4s
    ushl    v17.4s, v17.4s, v2.4s
    ushl    v18.4s, v18.4s, v2.4s
    ushl    v19.4s, v19.4s, v2.4s
    ushl    v20.4s, v20.4s, v2.4s
    ushl    v21.4s, v21.4s, v2.4s
    ushl    v22.4s, v22.4s, v2.4s
    ushl    v23.4s, v23.4s, v2.4s
// 40

    ushr    v24.2d, v16.2d, #4      // 0.5, 2
    ushr    v17.2d, v17.2d, #4
    ushr    v18.2d, v18.2d, #4
    ushr    v19.2d, v19.2d, #4
    ushr    v20.2d, v20.2d, #4
    ushr    v21.2d, v21.2d, #4
    ushr    v22.2d, v22.2d, #4
    ushr    v23.2d, v23.2d, #4
// 44

    ushr    v25.2d, v17.2d, #8
    ushr    v26.2d, v18.2d, #16
    ushr    v27.2d, v19.2d, #24
    ushr    v28.2d, v20.2d, #32
    ushr    v29.2d, v21.2d, #40
    ushr    v30.2d, v22.2d, #48
// 47

    sli     v24.2d, v17.2d, #56     // 1, 2
    sli     v25.2d, v18.2d, #48
    sli     v26.2d, v19.2d, #40
    sli     v27.2d, v20.2d, #32
    sli     v28.2d, v21.2d, #24
    sli     v29.2d, v22.2d, #16
    sli     v30.2d, v23.2d, #8
    subs    len, len, #128
// 54

    st4     {v24.d-v27.d}[0], [pBin], #32   // 4
    st3     {v28.d-v30.d}[0], [pBin], #24   // 3
    st4     {v24.d-v27.d}[1], [pBin], #32
    st3     {v28.d-v30.d}[1], [pBin], #24
// 68
    b.gt    1b
.balign 16
    ret
.endfunc
Jake 'Alquimista' LEE
  • 6,197
  • 2
  • 17
  • 25
Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
  • I did the bean counting, and your version takes 38 cycles in D-form(64bytes/iteration) and 78 cycles in Q-form(128bytes/iteration). I modifed this to three register based shifts followed by `ushr` by 4 for better performance. – Jake 'Alquimista' LEE Dec 26 '22 at 03:50
  • @Jake'Alquimista'LEE: With what cost model? Cortex-A76 for example has 1/clock throughput for `ushl` on 128-bit vectors, same for 64-bit. Neoverse / Graviton has 2/clock. I had CPUs like that in mind when optimizing this, as I said in the answer. (I'm being optimistic about unaligned loads not being a bottleneck...) Out-of-order exec can hopefully hide the latency chains, and unrolling can let you interleave two or more vectors so OoO exec doesn't have to work as hard. Obviously it's not carefully tuned for any particular CPU, though, especially not in-order pipelines. – Peter Cordes Dec 26 '22 at 04:22
  • Fortunately Aki's idea of splitting and re-combining bit-fields with two different shifts makes my idea in this answer mostly obsolete, assuming it can be adapted to unpack as well. I hadn't been considering ever shifting out any bits we want to keep, always keeping bitfields contiguous. – Peter Cordes Dec 26 '22 at 04:25
  • Sorry, it's 68 cycles in Q-form. (you win again). I improved your idea to `ushlq_u8`(by regitser), `ushlq_u16`(by regitser), `ushlq_u32`(by regitser), then `ushrq_u64`(by 4). We have exact the same result as Aki's in 3.5 cycles instead of 4. As for Q-form, I found out that `ld4/ld3/st4/st3` don't suffer any penalty when dealing with single 64bit lanes. I could post the whole code if you want. – Jake 'Alquimista' LEE Dec 26 '22 at 04:49
  • I always optimize for in-order little cores such as `Cortex-a55` because majority of chips come in big.LITTLE configuration. And codes optimized for in-order little cores very rarely run slower on out-of-order big cores. You never know what the OS's scheduler does. – Jake 'Alquimista' LEE Dec 26 '22 at 05:32
  • @Jake'Alquimista'LEE: Sure, scheduling for in-order of course makes sense when targeting typical phones. That's not what I chose to do, and the OP didn't specify. (I mostly intended it as the outline of an idea, to be unrolled / scheduled as appropriate.) Hopefully my version is useful for someone targeting an AArch64 server, like AWS instances, or MacOS desktop/laptop, as even the "little" cores (IceStorm) have full-width 128-bit SIMD units (2/clock `ushl` by vector), and some degree of out-of-order exec. https://dougallj.github.io/applecpu/icestorm.html. – Peter Cordes Dec 26 '22 at 05:50
  • @Jake'Alquimista'LEE: I expect it would be useful to some future readers to post a full micro-optimized version, either as an edit to your own answer, a new section in *this* answer (feel free to edit mine), or maybe a new answer. Did you consider optimizing Aki's idea? It seems promising, fewer operations to get the whole thing done. – Peter Cordes Dec 26 '22 at 05:54
  • Done. Aki's idea is a good one, but it's on par with this in best case. – Jake 'Alquimista' LEE Dec 26 '22 at 06:25
3

With variable shifting the problem becomes quite simple:

          MSB                                                            LSB
 a0 = 0AAAAAAA'0bBBBBBB'0ccCCCCC'0dddDDDD'0eeeeEEE'0fffffFF'0ggggggG'0hhhhhhh
 a1 = AAAAAAA0'BBBBBB00'CCCCC000'DDDD0000'EEE00000'FF000000'G0000000'00000000 = a0 << {1,2,3,4,5,6,7,8}
 a2 = 00000000'0000000b'000000cc'00000ddd'0000eeee'000fffff'00gggggg'0hhhhhhh = a0 >> {7,6,5,4,3,2,1,0}
 a3 = 00000000'AAAAAAA0'BBBBBB00'CCCCC000'DDDD0000'EEE00000'FF000000'G0000000 = ext(a1, a1, 1);
 a4 = 00000000'AAAAAAAb'BBBBBBcc'CCCCCddd'DDDDeeee'EEEfffff'FFgggggg'Ghhhhhhh = a2 | a3

auto d1 = vshl_s8(d0, vcreate_s8(0x0102030405060708ull));
auto d2 = vshl_s8(d0, vcreate_s8(0xf9fafbfcfdfeff00ull));
auto d3 = vext_u8(d1,d1,1);
return vorr_u8(d2,d3);
Aki Suihkonen
  • 19,144
  • 1
  • 36
  • 57
  • Does that work? I thought `vshl_s8` would block propagation of bits across 8-bit boundaries, so e.g. a shift count of `7` for a byte would bring the low bit to the top, and the other 7 bits would get thrown away, not shifted into anything else. Or is this re-assembling the 7-bit fields from two shifts that together still have all the bits, from different sides of a byte boundary? That's what the `vorr` is doing? – Peter Cordes Dec 25 '22 at 11:46
  • The `vorr` just combines the bytes, as would `vadd`. All lanes are shifted both left and right, but we need the `vext` to shift all the lanes of a1 (or d1 as in source) right by 8 bits / 1 byte. – Aki Suihkonen Dec 25 '22 at 11:54
  • All four instructions don't dual issue in Q-form(4 cycles). I think Peter's version modified to three register based shifts plus one right shift by 4 is better since the shift by immediate does dual issue in Q-from(3.5 cycles). – Jake 'Alquimista' LEE Dec 26 '22 at 03:44
  • While this version has shorter dependency chain, I couldn't make it operate on 16-byte vectors efficiently. The intermediate values are of form `0abcdefg'0abcdefg`, while it should be e.g. `0abcdefg'abcdefg0`, followed by `ext q0,q0,1` and `vst1q_u8(dst, q0)`. On M1 `vst1_u8(dst, vget_low_u8(q0)); vst1_u8(dst + 7, vget_high_u8())` is about 10% slower than Peter's. – Aki Suihkonen Dec 26 '22 at 06:29
  • I added the full assembly code in Peter's answer with all the cycle counting for `Cortex-A55` Fortunately, you can read assembly code. :-) – Jake 'Alquimista' LEE Dec 26 '22 at 06:44
1
void ascii_pack_neon(uint8_t *pBin, uint8_t *pAscii, intptr_t len)
{
    assert(len >= 64);
    assert((len & 63) == 0);

    uint8x8x4_t ina, inb, outa;
    uint8x8x3_t outb;
    uint8x8_t row1, row2, row3, row4, row5, row6, row7;

    do {
        len -= 64;
        ina = vld4_u8(pAscii); pAscii += 32;
        inb = vld4_u8(pAscii); pAscii += 32;

        // finish transposing
        outa.val[0] = vuzp1_u8(ina.val[0], inb.val[0]);
        row1 = vuzp1_u8(ina.val[1], inb.val[1]);
        row2 = vuzp1_u8(ina.val[2], inb.val[2]);
        row3 = vuzp1_u8(ina.val[3], inb.val[3]);

        row4 = vuzp2_u8(ina.val[0], inb.val[0]);
        row5 = vuzp2_u8(ina.val[1], inb.val[1]);
        row6 = vuzp2_u8(ina.val[2], inb.val[2]);
        row7 = vuzp2_u8(ina.val[3], inb.val[3]);

        outa.val[1] = vshr_n_u8(row1, 1);
        outa.val[2] = vshr_n_u8(row2, 2);
        outa.val[3] = vshr_n_u8(row3, 3);

        outb.val[0] = vshr_n_u8(row4, 4);
        outb.val[1] = vshr_n_u8(row5, 5);
        outb.val[2] = vshr_n_u8(row6, 6);

        outa.val[0] = vsli_n_u8(outa.val[0], row1, 7);
        outa.val[1] = vsli_n_u8(outa.val[1], row2, 6);
        outa.val[2] = vsli_n_u8(outa.val[2], row3, 5);
        outa.val[3] = vsli_n_u8(outa.val[3], row4, 4);
        
        outb.val[0] = vsli_n_u8(outb.val[0], row5, 3);
        outb.val[1] = vsli_n_u8(outb.val[1], row6, 2);
        outb.val[2] = vsli_n_u8(outb.val[2], row7, 1);

        vst4_lane_u8(pBin, outa, 0); pBin += 4;
        vst3_lane_u8(pBin, outb, 0); pBin += 3;
        vst4_lane_u8(pBin, outa, 1); pBin += 4;
        vst3_lane_u8(pBin, outb, 1); pBin += 3;
        vst4_lane_u8(pBin, outa, 2); pBin += 4;
        vst3_lane_u8(pBin, outb, 2); pBin += 3;
        vst4_lane_u8(pBin, outa, 3); pBin += 4;
        vst3_lane_u8(pBin, outb, 3); pBin += 3;
        vst4_lane_u8(pBin, outa, 4); pBin += 4;
        vst3_lane_u8(pBin, outb, 4); pBin += 3;
        vst4_lane_u8(pBin, outa, 5); pBin += 4;
        vst3_lane_u8(pBin, outb, 5); pBin += 3;
        vst4_lane_u8(pBin, outa, 6); pBin += 4;
        vst3_lane_u8(pBin, outb, 6); pBin += 3;
        vst4_lane_u8(pBin, outa, 7); pBin += 4;
        vst3_lane_u8(pBin, outb, 7); pBin += 3;
    } while (len);
}

Below is the conventional version without transposing, which is much longer than the previous one:

static inline uint64x1_t pack8(uint64x1_t in)
{
    const uint64x1_t mask1 = vdup_n_u64(0x007f007f007f007f);
    const uint64x1_t mask2 = vdup_n_u64(0x00003fff00003fff);
    const uint64x1_t mask4 = vdup_n_u64(0x000000000fffffff);

    in = vbsl_u64(mask1, in, vshr_n_u64(in, 1));
    in = vbsl_u64(mask2, in, vshr_n_u64(in, 2));
    in = vbsl_u64(mask4, in, vshr_n_u64(in, 4));

    return in;
}


void ascii_pack_neon_conventional(uint8_t *pBin, uint8_t *pAscii, intptr_t len)
{
    // assert(len >= 64);
    // assert((len & 63) == 0);

    uint64x1x4_t ina, inb, outa;
    uint64x1x3_t outb;
    uint64x1_t row1, row2, row3, row4, row5, row6, row7;

    do {
        len -= 64;
        ina = vld1_u64_x4((uint64_t *)pAscii); pAscii += 32;
        inb = vld1_u64_x4((uint64_t *)pAscii); pAscii += 32;

        outa.val[0] = pack8(ina.val[0]);
        row1 = pack8(ina.val[1]);
        row2 = pack8(ina.val[2]);
        row3 = pack8(ina.val[3]);
        row4 = pack8(inb.val[0]);
        row5 = pack8(inb.val[1]);
        row6 = pack8(inb.val[2]);
        row7 = pack8(inb.val[3]);

        outa.val[1] = vshr_n_u64(row1, 8);
        outa.val[2] = vshr_n_u64(row2, 16);
        outa.val[3] = vshr_n_u64(row3, 24);
        outb.val[0] = vshr_n_u64(row4, 32);
        outb.val[1] = vshr_n_u64(row5, 40);
        outb.val[2] = vshr_n_u64(row6, 48);

        outa.val[0] = vsli_n_u64(outa.val[0], row1, 56);
        outa.val[1] = vsli_n_u64(outa.val[1], row2, 48);
        outa.val[2] = vsli_n_u64(outa.val[2], row3, 40);
        outa.val[3] = vsli_n_u64(outa.val[3], row4, 32);
        outb.val[0] = vsli_n_u64(outa.val[0], row5, 24);
        outb.val[1] = vsli_n_u64(outa.val[1], row6, 16);
        outb.val[2] = vsli_n_u64(outa.val[2], row7, 8);

        vst1_u64_x4((uint64_t *)pBin, outa); pBin += 32;
        vst1_u64_x3((uint64_t *)pBin, outb); pBin += 24;
    } while (len);
}

It seems that GCC is the culprit here: godbolt link (transposing)
And GCC keeps being a disaster even in conventional version

Conclusion: ditch GCC. Use Clang instead, or better - write in assembly:

    .arch armv8-a
    .global ascii_pack_asm_transpose, ascii_pack_asm_conventional
    .text

pBin    .req    x0
pAscii  .req    x1
len     .req    w2


.balign 64
.func
ascii_pack_asm_transpose:
1:
    ld4     {v16.8b, v17.8b, v18.8b, v19.8b}, [pAscii], #32
    ld4     {v20.8b, v21.8b, v22.8b, v23.8b}, [pAscii], #32
    subs    len, len, #64

    uzp1    v0.8b, v16.8b, v20.8b
    uzp1    v24.8b, v17.8b, v21.8b
    uzp1    v25.8b, v18.8b, v22.8b
    uzp1    v26.8b, v19.8b, v23.8b
    uzp2    v27.8b, v16.8b, v20.8b
    uzp2    v28.8b, v17.8b, v21.8b
    uzp2    v29.8b, v18.8b, v22.8b
    uzp2    v30.8b, v19.8b, v23.8b

    ushr    v1.8b, v24.8b, #1
    ushr    v2.8b, v25.8b, #2
    ushr    v3.8b, v26.8b, #3
    ushr    v4.8b, v27.8b, #4
    ushr    v5.8b, v28.8b, #5
    ushr    v6.8b, v29.8b, #6

    sli     v0.8b, v24.8b, #7
    sli     v1.8b, v25.8b, #6
    sli     v2.8b, v26.8b, #5
    sli     v3.8b, v27.8b, #4
    sli     v4.8b, v28.8b, #3
    sli     v5.8b, v29.8b, #2
    sli     v6.8b, v30.8b, #1

    st4     {v0.b, v1.b, v2.b, v3.b}[0], [pBin], #4
    st3     {v4.b, v5.b, v6.b}[0], [pBin], #3
    st4     {v0.b, v1.b, v2.b, v3.b}[1], [pBin], #4
    st3     {v4.b, v5.b, v6.b}[1], [pBin], #3
    st4     {v0.b, v1.b, v2.b, v3.b}[2], [pBin], #4
    st3     {v4.b, v5.b, v6.b}[2], [pBin], #3
    st4     {v0.b, v1.b, v2.b, v3.b}[3], [pBin], #4
    st3     {v4.b, v5.b, v6.b}[3], [pBin], #3
    st4     {v0.b, v1.b, v2.b, v3.b}[4], [pBin], #4
    st3     {v4.b, v5.b, v6.b}[4], [pBin], #3
    st4     {v0.b, v1.b, v2.b, v3.b}[5], [pBin], #4
    st3     {v4.b, v5.b, v6.b}[5], [pBin], #3
    st4     {v0.b, v1.b, v2.b, v3.b}[6], [pBin], #4
    st3     {v4.b, v5.b, v6.b}[6], [pBin], #3
    st4     {v0.b, v1.b, v2.b, v3.b}[7], [pBin], #4
    st3     {v4.b, v5.b, v6.b}[7], [pBin], #3
    b.gt    1b

.balign 16
    ret
.endfunc

/////////////////////////////////////////////////////////////

.balign 64
.func
ascii_pack_asm_conventional:
    adr     x3, 2f
    sub     pAscii, pAscii, #16
    sub     pBin, pBin, #8
    movi    v0.4h, #0x007f      // mask1
    ldp     d1, d2, [x3]        // mask2, mask4
    b       1f

.balign 16
2:
    .long   0x00003fff, 0x00003fff
    .long   0x0fffffff, 0x00000000

.balign 64
1:
    ldp     d16, d17, [pAscii, #16]
    ldp     d18, d19, [pAscii, #32]
    ldp     d20, d21, [pAscii, #48]
    ldp     d22, d23, [pAscii, #64]!
    subs    len, len, #64

    ushr    d24, d16, #1
    ushr    d25, d17, #1
    ushr    d26, d18, #1
    ushr    d27, d19, #1
    ushr    d28, d20, #1
    ushr    d29, d21, #1
    ushr    d30, d22, #1
    ushr    d31, d23, #1

    bif     v16.8b, v24.8b, v0.8b
    bif     v17.8b, v25.8b, v0.8b
    bif     v18.8b, v26.8b, v0.8b
    bif     v19.8b, v27.8b, v0.8b
    bif     v20.8b, v28.8b, v0.8b
    bif     v21.8b, v29.8b, v0.8b
    bif     v22.8b, v30.8b, v0.8b
    bif     v23.8b, v31.8b, v0.8b

    ushr    d24, d16, #2
    ushr    d25, d17, #2
    ushr    d26, d18, #2
    ushr    d27, d19, #2
    ushr    d28, d20, #2
    ushr    d29, d21, #2
    ushr    d30, d22, #2
    ushr    d31, d23, #2

    bif     v16.8b, v24.8b, v1.8b
    bif     v17.8b, v25.8b, v1.8b
    bif     v18.8b, v26.8b, v1.8b
    bif     v19.8b, v27.8b, v1.8b
    bif     v20.8b, v28.8b, v1.8b
    bif     v21.8b, v29.8b, v1.8b
    bif     v22.8b, v30.8b, v1.8b
    bif     v23.8b, v31.8b, v1.8b

    ushr    d24, d16, #4
    ushr    d25, d17, #4
    ushr    d26, d18, #4
    ushr    d27, d19, #4
    ushr    d28, d20, #4
    ushr    d29, d21, #4
    ushr    d30, d22, #4
    ushr    d31, d23, #4

    bif     v16.8b, v24.8b, v2.8b
    bif     v17.8b, v25.8b, v2.8b
    bif     v18.8b, v26.8b, v2.8b
    bif     v19.8b, v27.8b, v2.8b
    bif     v20.8b, v28.8b, v2.8b
    bif     v21.8b, v29.8b, v2.8b
    bif     v22.8b, v30.8b, v2.8b
    bif     v23.8b, v31.8b, v2.8b

    ushr    d24, d17, #8
    ushr    d25, d18, #16
    ushr    d26, d19, #24
    ushr    d27, d20, #32
    ushr    d28, d21, #40
    ushr    d29, d22, #48

    sli     d16, d17, #56
    sli     d24, d18, #48
    sli     d25, d19, #40
    sli     d26, d20, #32
    sli     d27, d21, #24
    sli     d28, d22, #16
    sli     d29, d23, #8

    stp     d16, d24, [pBin, #8]
    stp     d25, d26, [pBin, #24]
    stp     d27, d28, [pBin, #40]
    str     d29, [pBin, #56]!

    b.gt    1b

.balign 16
    ret
.endfunc

.end

Now you can see clearly that the transposing version is vastly superior, provided the chip doesn't mind unaligned stores much. (most armv8a ones don't).

You may ask why I don't use quad registers instead of double ones: on armv8, most instructions on quad registers have half the throughput of double ones. There is hardly any gain, if any while being less flexible. This might be different on more advanced cores.

Jake 'Alquimista' LEE
  • 6,197
  • 2
  • 17
  • 25
  • Hi Jake, thanks and it looks very much impressive, but it was much worse than the naive, scalar solution in terms of performance :) – Roman Dec 19 '22 at 13:44
  • @Roman: What hardware did you test on? (And compiler version / options). Perhaps Jake was tuning for a micro-architecture that had a lot higher store throughput for small 4 and 3-byte stores (especially with `st4` and `st3` instructions), and can coalesce them in the store buffer to not bottleneck on commit to cache? – Peter Cordes Dec 19 '22 at 20:11
  • @Roman I think your target platform doesn't like that store part of mine. Nevertheless, neon can handle the conventional approach better than the arm integer core thanks to `vbsl` and `vsli` instruction. I'll post this another version soon. – Jake 'Alquimista' LEE Dec 20 '22 at 06:30
  • 1
    @PeterCordes I checked the disassembly and was shocked. I knew that arm compilers are bad, but not **this** bad..... – Jake 'Alquimista' LEE Dec 20 '22 at 07:45
  • OP didn't say what kind of AArch64 they're tuning for. To pick a random example, I checked the optimization manual for Cortex-A76; it has full throughput for `q` operand-size. (And quite low throughput for `st3` and `st4` stores, like 0.5 per clock for `st3` of one lane.) I added some numbers to my answer. Your "conventional" pack could probably do better by following the inverse pattern of my unpack, closing up the zeros between pairs of elements in the middle of a wider element by shifting right + left alternating. That would avoid most of the `bif`s. – Peter Cordes Dec 20 '22 at 20:51
  • I use graviton2 servers on AWS (r6g family). – Roman Dec 26 '22 at 07:25