ARM NEON: Convert a binary 8-bit-per-pixel image (only 0/1) to 1-bit-per-pixel?

Question

I am working on a task to convert a large binary label image, which has 8 bits (uint8_t) per pixel and each pixel can only be 0 or 1 (or 255), to an array of uint64_t numbers and each bit in uint64_t number represent a label pixel.

For example,

input array: 0 1 1 0 ... (00000000 00000001 00000001 00000000 ...)

or input array: 0 255 255 0 ... (00000000 11111111 11111111 00000000 ...)

output array (number): 6 (because after convert each uint8_t to bit, it becomes 0110)

Currently the C code to achieve this is:

 for (int j = 0; j < width >> 6; j++) {
        uint8_t* in_ptr= in + (j << 6);
        uint64_t out_bits = 0;
        if (in_ptr[0]) out_bits |= 0x0000000000000001;
        if (in_ptr[1]) out_bits |= 0x0000000000000002;
        .
        .
        .
        if (in_ptr[63]) out_bits |= 0x8000000000000000;
       *output = obits； output ++;
    }

Can ARM NEON optimize this functionality? Please help. Thank you!

You could `vmvnq_u8(vcezq_u8(input))` to get either all zeros or all ones, then `vandq_u8` with a vector of 1, 2, 4, 8, 16, 32, 64, 128, 1, 2, 4, 8, 16, 32, 64, 128 to get bits set in the right place. Then a series of `vhaddq` until you get to a pair of 64-bit values. Left shift each 64-bit element by 0, 16, 32, or 48 bits (depending on the position), then bitwise OR them together to get the final bitmask. — nemequ, Jan 19 '22 at 05:13
@NateEldredge Right now I can only use 32-bit arm and maybe will support 64-bit ARM later... So it will be appreciated if both hints/solutions can be provided. :) — debug_all_the_time, Jan 19 '22 at 06:36
ARM64 can at least do 8 input bytes at a time by doing USHL by a vector of `{0,1,2,3,4,5,6,7}`, then ADDV to reduce to one byte. Probably can do better, but it's a start. — Nate Eldredge, Jan 19 '22 at 07:09
Wait, do you get to decide whether an "on" pixel has value 1 or 255, or do you have to handle both possibilities? (I guess in the latter case, you can AND with `{1,1,1,1,1,1,1,1}` first, reducing to the 1 case.) — Nate Eldredge, Jan 19 '22 at 07:14
@NateEldredge Thanks for the advice. I haven't decided to use 1 or 255. It depends on the conveniency of converting to NEON. I don't need to handle both cases :) — debug_all_the_time, Jan 19 '22 at 17:02
@nemequ Thank you for your suggestion. I will try your idea to see how it works! :) — debug_all_the_time, Jan 19 '22 at 17:04
If you have a choice, use 255 not 1. That eliminates the need to compare to zero then flip the bits (`vmvnq_u8(vcezq_u8(input))`). For SIMD you pretty much always want to use all bits set or all bits unset. — nemequ, Jan 19 '22 at 21:26

Jake 'Alquimista' LEE · Accepted Answer · 2022-01-19T09:18:52.190

Assuming the input value is either 0 or 255, below is the basic version which is rather straightforward, especially for people with Intel SSE/AVX experience.

void foo_basic(uint8_t *pDst, uint8_t *pSrc, intptr_t length)
{
    //assert(length >= 64);
    //assert(length & 7 == 0);
    uint8x16_t in0, in1, in2, in3;
    uint8x8_t out;
    const uint8x16_t mask = {1, 2, 4, 8, 16, 32, 64, 128, 1, 2, 4, 8, 16, 32, 64, 128};

    length -= 64;

    do {
        do {
            in0 = vld1q_u8(pSrc); pSrc += 16;
            in1 = vld1q_u8(pSrc); pSrc += 16;
            in2 = vld1q_u8(pSrc); pSrc += 16;
            in3 = vld1q_u8(pSrc); pSrc += 16;

            in0 &= mask;
            in1 &= mask;
            in2 &= mask;
            in3 &= mask;

            in0 = vpaddq_u8(in0, in1);
            in2 = vpaddq_u8(in2, in3);

            in0 = vpaddq_u8(in0, in2);

            out = vpadd_u8(vget_low_u8(in0), vget_high_u8(in0));

            vst1_u8(pDst, out); pDst += 8;

            length -= 64;
        } while (length >=0);

        pSrc += length>>3;
        pDst += length;
    } while (length > -64);
}

Neon however has VERY user friendly and efficient permutation and bit operation instructions that allow to go "vertical"

void foo_advanced(uint8_t *pDst, uint8_t *pSrc, intptr_t length)
{
    //assert(length >= 128);
    //assert(length & 7 == 0);
    uint8x16x4_t in0, in1;
    uint8x16x2_t row04, row15, row26, row37;

    length -= 128;

    do {
        do {
            in0 = vld4q_u8(pSrc); pSrc += 64;
            in1 = vld4q_u8(pSrc); pSrc += 64;

            row04 = vuzpq_u8(in0.val[0], in1.val[0]);
            row15 = vuzpq_u8(in0.val[1], in1.val[1]);
            row26 = vuzpq_u8(in0.val[2], in1.val[2]);
            row37 = vuzpq_u8(in0.val[3], in1.val[3]);

            row04.val[0] = vsliq_n_u8(row04.val[0], row15.val[0], 1);
            row26.val[0] = vsliq_n_u8(row26.val[0], row37.val[0], 1);
            row04.val[1] = vsliq_n_u8(row04.val[1], row15.val[1], 1);
            row26.val[1] = vsliq_n_u8(row26.val[1], row37.val[1], 1);

            row04.val[0] = vsliq_n_u8(row04.val[0], row26.val[0], 2);
            row04.val[1] = vsliq_n_u8(row04.val[1], row26.val[1], 2);

            row04.val[0] = vsliq_n_u8(row04.val[0], row04.val[1], 4);

            vst1q_u8(pDst, row04.val[0]); pDst += 16;

            length -= 128;
        } while (length >=0);

        pSrc += length>>3;
        pDst += length;
    } while (length > -128);
}

The Neon-only advanced version is shorter and faster, but GCC is extremely bad at dealing with Neon specific permutation instructions such as vtrn, vzip, and vuzp.

https://godbolt.org/z/bGdbohqKe

Clang isn't any better: it spams unnecessary vorr where GCC does the same with vmov.

    .syntax unified
    .arm
    .arch   armv7-a
    .fpu    neon
    .global foo_asm
    .text

.func
.balign 64
foo_asm:
    sub     r2, r2, #128

.balign 16
1:
    vld4.8      {d16, d18, d20, d22}, [r1]!
    vld4.8      {d17, d19, d21, d23}, [r1]!
    vld4.8      {d24, d26, d28, d30}, [r1]!
    vld4.8      {d25, d27, d29, d31}, [r1]!
    subs    r2, r2, #128

    vuzp.8      q8, q12
    vuzp.8      q9, q13
    vuzp.8      q10, q14
    vuzp.8      q11, q15

    vsli.8      q8, q9, #1
    vsli.8      q10, q11, #1
    vsli.8      q12, q13, #1
    vsli.8      q14, q15, #1

    vsli.8      q8, q10, #2
    vsli.8      q12, q14, #2

    vsli.8      q8, q12, #4

    vst1.8      {q8}, [r0]!
    bpl     1b

    add     r1, r1, r2
    cmp     r2, #-128
    add     r0, r0, r2, asr #3

    bgt     1b
.balign 8
    bx      lr

.endfunc
.end

The inner most loop consists of :
GCC: 32 instructions
Clang: 30 instructions
Asm: 18 instructions

It doesn't take rocket science to figure out which one is the fastest and by how much: Never trust compilers if you are about to do permutations.

You can do it quite a bit better using masking and `addv` instead of a permutation. — fuz, Dec 08 '22 at 00:50
@Jake'Alquimista'LEE It sure beats a long cascade of permutes and `sli` instructions. — fuz, Dec 08 '22 at 07:56

score 3 · Answer 2 · answered Jan 21 '22 at 09:50

Standing on the shoulder of Jake 'Alquimista' LEE, we can improve the unzipping instruction and the algorithm as well by changing the order of the zip and vlsi operators:

#define interleave_nibbles(top) \
    top.val[0] = vsliq_n_u8(top.val[0], top.val[1],1);\
    top.val[2] = vsliq_n_u8(top.val[2], top.val[3],1);\
    top.val[0] = vsliq_n_u8(top.val[0], top.val[2],2); 

void transpose_bits(uint8_t const *src, uint8_t *dst) {
    uint8x16x4_t top = vld4q_u8(src);
    uint8x16x4_t bot = vld4q_u8(src + 64); src+=128;
    interleave_nibbles(top);
    interleave_nibbles(bot);
    // now we have 4 bits correct in each of the 32 bytes left
    // top = 0to3 4to7 8to11 12to15 ...
    // bot = 64to67 68to71 ...
    uint8x16x2_t top_bot = vuzpq_u8(top.val[0], bot.val[0]);
    uint8x16_t result = vsliq_n_u8(top_bot.val[0], top_bot.val[1], 4);
    vst1q_u8(dst, result); dst += 16;
}

The produced assembler by clang has now only two extraneous movs (by or) and gcc output has four movs.

    vld4.8  {d16, d18, d20, d22}, [r0]!
    vld4.8  {d17, d19, d21, d23}, [r0]!
    vld4.8  {d24, d26, d28, d30}, [r0]!
    vsli.8  q10, q11, #1
    vorr    q0, q8, q8
    vld4.8  {d25, d27, d29, d31}, [r0]
    vsli.8  q0, q9, #1
    vorr    q2, q14, q14
    vsli.8  q12, q13, #1
    vsli.8  q2, q15, #1
    vsli.8  q0, q10, #2
    vsli.8  q12, q2, #2
    vuzp.8  q0, q12
    vsli.8  q0, q12, #4
    vst1.8  {d0, d1}, [r1]

And the arm64 version looks perfect with only 12 instructions.

    ld4     { v0.16b, v1.16b, v2.16b, v3.16b }, [x0], #64
    ld4     { v4.16b, v5.16b, v6.16b, v7.16b }, [x0]
    sli     v0.16b, v1.16b, #1
    sli     v2.16b, v3.16b, #1
    sli     v0.16b, v2.16b, #2
    sli     v4.16b, v5.16b, #1
    sli     v6.16b, v7.16b, #1
    sli     v4.16b, v6.16b, #2
    uzp1    v16.16b, v0.16b, v4.16b
    uzp2    v0.16b, v0.16b, v4.16b
    sli     v16.16b, v0.16b, #4
    str     q16, [x1]

It's really great how you managed to reduce the number of `vuzp`. — Jake 'Alquimista' LEE, Jan 23 '22 at 08:03

fuz · Answer 3 · 2022-12-09T15:02:21.693

You can do it more efficiently (especially for short arrays or single vectors) using something like this (in this example, turning one 128 bit register into one 16 bit mask):

        // turn mask of bytes in v0 into mask of bits in w0
movmsk: adr     x0, 0f                  // obtain address of literal
        ld1r    {v1.2d}, [x0]           // load 80..01 mask twice into v1
        and     v0.16b, v0.16b, v1.16b  // mask bytes from ff to single bits
        mov     d1, v0.d[1]             // extract high 64 bit
        zip1    v0.8b, v0.8b, v1.8b     // interleave high and low bytes
        addv    h0, v0.8h               // sum into bit mask
        mov     w0, v0.s[0]             // move result to general register
        ret
0:      .quad   0x8040201008040201

The idea is to turn the contents of each byte into just one bit at the bit position it's going to end up at and to then sum up the bits using addv (8 bytes at a time, resulting in one byte of output).

Putting a loop around this code to have it traverse the entire array is left as an exercise to the reader.

ARM NEON: Convert a binary 8-bit-per-pixel image (only 0/1) to 1-bit-per-pixel?

3 Answers3

Linked