NEON can quickly narrow a 128-bit comparision byte mask to 64-bits,
using either 'shift right and narrow (SHRN
)' or 'pack using signed saturation (VQMOVN/SQXTN
)'.
This allows the mask to be extracted to a 64-bit general purpose register on __aarch64__
.
Once extracted, the mask can be checked for all-zeros or all-ones (-1).
The first match could be found using __builtin_ctzll(m)
(rbit/clz
).
All matches could be enumerated by clearing any redunant bits then stepping with m &= m - 1
.
etc.
See Danila Kutenin's blog post Porting x86 vector bitmask optimizations to Arm NEON.
// find first match
uint8x16_t byte_mask = vceqq_u8(input, vdupq_n_u8('A'))
uint8x8_t nibble_mask = vshrn_n_u16(vreinterpretq_u16_u8(byte_mask), 4);
uint64_t m = vget_lane_u64(vreinterpret_u64_u8(nibble_mask), 0);
if (m != 0) return __builtin_ctzll(m) >> 2;
If an actual bitmap is needed:
Scalar multiplication could be used to extract bits from each 64-bit word of the comparision mask.
See Arseny Kapoulkine's blog post VPEXPANDB on NEON with Z3.
Similar to Extracting bits with a single multiplication except optimized for bytes being either 0xFF
or 0x00
.
uint64_t NEON_i8x16_MatchMask (const uint8_t* ptr, uint8_t match_byte) {
uint8x16_t cmpMask = vceqq_u8(vld1q_u8(ptr), vdupq_n_u8(match_byte));
// extract each 64-bit lane into a 64-bit general purpose register
uint64_t hi = vgetq_lane_u64(vreinterpretq_u64_u8(cmpMask), 1);
uint64_t lo = vgetq_lane_u64(vreinterpretq_u64_u8(cmpMask), 0);
// extract bits
const uint64_t magic = 0x000103070f1f3f80ull;
hi = (hi * magic) >> 56;
lo = (lo * magic) >> 56;
return (hi << 8) + lo;
}
Processing 64+ bytes in bulk may allow for a more efficient implementation of the 'movemask' operation.
See Geoff Langdale's blog post Fitting My Head Through The ARM Holes.
Similar to ARM NEON: Convert a binary 8-bit-per-pixel image (only 0/1) to 1-bit-per-pixel?
uint64_t NEON_i8x64_MatchMask (const uint8_t* ptr, uint8_t match_byte)
{
// load interleaved (slow)
uint8x16x4_t v = vld4q_u8(ptr);
// detect matches
uint8x16_t tag = vdupq_n_u8(match_byte);
uint8x16_t v0 = vceqq_u8(v.val[0], tag);
uint8x16_t v1 = vceqq_u8(v.val[1], tag);
uint8x16_t v2 = vceqq_u8(v.val[2], tag);
uint8x16_t v3 = vceqq_u8(v.val[3], tag);
// collect the MSB of 4 bytes vertically into hi-nibble of result byte
uint8x16_t acc = vsriq_n_u8(vsriq_n_u8(v3, v2, 1), vsriq_n_u8(v1, v0, 1), 2);
// pack the hi-nibbles together
uint8x8_t r = vshrn_n_u16(vreinterpretq_u16_u8(vsriq_n_u8(acc, acc, 4)), 4);
return vget_lane_u64(vreinterpret_u64_u8(r), 0);
}