Convert vector compare mask into bit mask in AArch64 SIMD or ARM NEON?

Question

Lets take the example of "ABAA". I can use result = vceqq_u8(input, vdupq_n_u8('A')) to get FF 00 FF FF (or 0xFFFF00FF).

Sometimes I only need to know the first match, other times I want to know all. From the result register is there a way I can get A) the index of the first match? which would be 0 in this case since it starts with 'A' (low byte is FF) B) Get the binary 1101? (only second letter doesn't match so second bit is 0)

On avx2 I've used movemask to get the bits and tzcnt to get the index. I can't seem to find something like movemask on neon

For 32-bit code, [SSE \_mm\_movemask\_epi8 equivalent method for ARM NEON](https://stackoverflow.com/q/11870910) has answers for arbitrary inputs, and another answer assuming a compare result (all-zero / all-one). For finding the *first* match position, IDK if that's the ideal starting point. Glibc strlen and memchr for AArch64 do use SIMD: https://codebrowser.dev/glibc/glibc/sysdeps/aarch64/memchr.S.html but after reducing 16 to 8 bytes, they use `clz(rbit(mask)) >> 2` on that 8-byte integer. — Peter Cordes, Dec 07 '22 at 21:36
It's a bit annoying but you can and with a mask like `8040201008040201` and then do a bunch of horizontal reductions. — fuz, Dec 07 '22 at 22:26
Does this answer your question? [ARM NEON: Convert a binary 8-bit-per-pixel image (only 0/1) to 1-bit-per-pixel?](https://stackoverflow.com/questions/70764147/arm-neon-convert-a-binary-8-bit-per-pixel-image-only-0-1-to-1-bit-per-pixel) — Jake 'Alquimista' LEE, Dec 07 '22 at 23:22

aqrit · Answer 1 · 2022-12-15T16:44:06.830

NEON can quickly narrow a 128-bit comparision byte mask to 64-bits, using either 'shift right and narrow (SHRN)' or 'pack using signed saturation (VQMOVN/SQXTN)'. This allows the mask to be extracted to a 64-bit general purpose register on __aarch64__.

Once extracted, the mask can be checked for all-zeros or all-ones (-1). The first match could be found using __builtin_ctzll(m) (rbit/clz). All matches could be enumerated by clearing any redunant bits then stepping with m &= m - 1. etc.

See Danila Kutenin's blog post Porting x86 vector bitmask optimizations to Arm NEON.

// find first match
uint8x16_t byte_mask = vceqq_u8(input, vdupq_n_u8('A'))
uint8x8_t nibble_mask = vshrn_n_u16(vreinterpretq_u16_u8(byte_mask), 4);
uint64_t m = vget_lane_u64(vreinterpret_u64_u8(nibble_mask), 0);
if (m != 0) return __builtin_ctzll(m) >> 2;

If an actual bitmap is needed:

Scalar multiplication could be used to extract bits from each 64-bit word of the comparision mask. See Arseny Kapoulkine's blog post VPEXPANDB on NEON with Z3. Similar to Extracting bits with a single multiplication except optimized for bytes being either 0xFF or 0x00.

uint64_t NEON_i8x16_MatchMask (const uint8_t* ptr, uint8_t match_byte) {
    uint8x16_t cmpMask = vceqq_u8(vld1q_u8(ptr), vdupq_n_u8(match_byte));

    // extract each 64-bit lane into a 64-bit general purpose register
    uint64_t hi = vgetq_lane_u64(vreinterpretq_u64_u8(cmpMask), 1);
    uint64_t lo = vgetq_lane_u64(vreinterpretq_u64_u8(cmpMask), 0);

    // extract bits
    const uint64_t magic = 0x000103070f1f3f80ull;
    hi = (hi * magic) >> 56;
    lo = (lo * magic) >> 56;
    return (hi << 8) + lo;
}

Processing 64+ bytes in bulk may allow for a more efficient implementation of the 'movemask' operation. See Geoff Langdale's blog post Fitting My Head Through The ARM Holes. Similar to ARM NEON: Convert a binary 8-bit-per-pixel image (only 0/1) to 1-bit-per-pixel?

uint64_t NEON_i8x64_MatchMask (const uint8_t* ptr, uint8_t match_byte)
{
    // load interleaved (slow)
    uint8x16x4_t v = vld4q_u8(ptr); 

    // detect matches
    uint8x16_t tag = vdupq_n_u8(match_byte);
    uint8x16_t v0 = vceqq_u8(v.val[0], tag);
    uint8x16_t v1 = vceqq_u8(v.val[1], tag);
    uint8x16_t v2 = vceqq_u8(v.val[2], tag);
    uint8x16_t v3 = vceqq_u8(v.val[3], tag);
    
    // collect the MSB of 4 bytes vertically into hi-nibble of result byte
    uint8x16_t acc = vsriq_n_u8(vsriq_n_u8(v3, v2, 1), vsriq_n_u8(v1, v0, 1), 2);
    
    // pack the hi-nibbles together
    uint8x8_t r = vshrn_n_u16(vreinterpretq_u16_u8(vsriq_n_u8(acc, acc, 4)), 4);
    return vget_lane_u64(vreinterpret_u64_u8(r), 0);
}

In case it's not obvious, you can do all the standard things with a nibble mask in a 64-bit register, like `clz` or `rbit`/`clz` (and right-shift by 2), or check it for all-zero or all-ones (`-1`). — Peter Cordes, Dec 09 '22 at 22:20

Convert vector compare mask into bit mask in AArch64 SIMD or ARM NEON?

1 Answers1

Linked