Fastest way to count bytes that contain a 4-bit value at any bit offset?

Question

I have an algorithm to search an array that contains 64 bytes of values and I want to find a 4 bit value and see how many times it occurs in the array.

For example, the first element in the array might be 10101010 and the 4 bit value is 1010 in which this only counts once. If the next element is 10000000, this would count as 0.

It's easy to go through all elements in the array and check each element if our 4 bit result is in it but is there a faster way to do this?

How does `10101010` contain `1010` only once? It's at least twice, if you require your nibbles "aligned", or 3 times if the top bit of the match can be any of the `1`s except the lowest. You accepted an answer that checks at 4 different bit offsets. Or is it any match within 1-byte chunks that counts, not a bitstream? I guess so; edited the title so future readers can find it. — Peter Cordes, Aug 09 '22 at 09:49
Or wait, no **the answer counts multiple matches at different offsets within a byte**, but not spanning across bytes. So it doesn't match the question. IDK, going to leave it for now. — Peter Cordes, Aug 09 '22 at 09:56

Filip Haglund · Accepted Answer · 2016-10-14T20:38:33.760

2

precompute:
- 4bit key offset by 0bit
- 4bit key offset by 1bit
- 4bit key offset by 2bit
- 4bit key offset by 3bit

- 4 bit mask offset by 0bit
- 4 bit mask offset by 1bit
- 4 bit mask offset by 2bit
- 4 bit mask offset by 3bit

for each byte in bytes:
  for each offset:
    mask out bytes we care about
    check if it matches our precomputed key

Since any item we don't look at could be a match or not, we have to look at all items, so O(n) at least.
Since it's sufficient to look at all items once, there's no reason to use a non-linear algorithm.
Since it's only 64 values total, highly optimized SIMD instructions are probably not worth the hassle. It's unlikely to improve performance.

The precomputing of the keys and the masks are done by shifting the key left one bit at a time, and by setting the 4 relevant bits to 1 in the masks.

match = (bytes[i] & (0x0F << offset)) == (key << offset)

for some key and each of the offsets 0-3. Since (key << offset) and (0x0F << offset) will be reused every four loops, precomputing the four values and unrolling the loop will be faster. Your compiler might do this for you, but if not, here's how to do it manually:

matches = 0
for (int i = 0; i < 64; i += 4) {
    const mask0 = 0x0F << 0
    const key0 = key << 0
    match0 = (bytes[i+0] & mask0) == key0

    const mask1 = 0x0F << 1
    const key1 = key << 1
    match1 = (bytes[i+1] & mask1) == key1

    ...

    matches += match0 + match1 + match2 + match3
}

edited Oct 14 '16 at 20:38

answered Oct 14 '16 at 20:19

Filip Haglund

13,919
13
64
113

I'm having a hard time following your answer. Are you saying the effort to optimize this algorithm won't be worth it? – Justin Case Oct 14 '16 at 20:26
I'm saying that it's unlikely that we could find a significantly faster algorithm than the one I described above. – Filip Haglund Oct 14 '16 at 20:27
Thank you. I'm still confused on the precomputation part. What exactly are you doing there? – Justin Case Oct 14 '16 at 20:29
For example, there's a 16byte-at-a-time SSE compare instruction which could be used to compare 16 bytes against one of the masks. Four masks -> four passes, but possibly a 4x speed up compared to linear checking. The speedup is not going to be visible at this size, though. – Filip Haglund Oct 14 '16 at 20:30
Ok that makes sense but how is this faster than checking each 4 bits in each element one at a time? Isn't this essentially doing the same thing? – Justin Case Oct 14 '16 at 20:35
1

The SSE instructions mentioned above does a compare of 16 bytes at a time, in a single clock cycle. This makes it 4x faster. But the result is delayed a few clock cycles, so it's only useful if used in a tight loop or some other place where we can do things while we wait. Otherwise we lose time by waiting. – Filip Haglund Oct 14 '16 at 20:41
@JustinCase and Filip: SSE2 would be very efficient for this; it would take a tiny bit more work to set up the masks, but on most CPUs they aren't higher latency. And even if they are (like on Bulldozer-family), it's throughput that matters. There's a *lot* of parallelism to be had here, and one instruction can start executing before the previous one finishes. (In fact, 3 SIMD-ALU instructions can start executing in the same clock cycle on current Intel, 4 in Zen). It's also very fast to get the total sum back from SIMD to integer regs at the end. – Peter Cordes Aug 09 '22 at 09:59
x86 doesn't have high latencies or pipeline stalls getting data between integer and XMM registers; it is useful in general with only a couple vectors of data, especially when you have single-byte elements. If you're lucky, a C implementation of this will auto-vectorize in a non-terrible way, but manual vectorization with `_mm_cmpeq_epi8` to compare (producing 0 / -1 integer results) and `sum = _mm_sub_epi8(sum, cmp_result)` to accumulate should be very good. (Or `_mm_or_si128` to only count at most one match per byte). – Peter Cordes Aug 09 '22 at 10:02
The 8-bit counts can't overflow with 255 vectors or fewer, so you can sum at the end with `_mm_sad_epu8` against `_mm_setzero_si128`. See [Fastest way to horizontally sum SSE unsigned byte vector](https://stackoverflow.com/q/36998538). – Peter Cordes Aug 09 '22 at 10:04

Fastest way to count bytes that contain a 4-bit value at any bit offset?

1 Answers1