precompute:
- 4bit key offset by 0bit
- 4bit key offset by 1bit
- 4bit key offset by 2bit
- 4bit key offset by 3bit
- 4 bit mask offset by 0bit
- 4 bit mask offset by 1bit
- 4 bit mask offset by 2bit
- 4 bit mask offset by 3bit
for each byte in bytes:
for each offset:
mask out bytes we care about
check if it matches our precomputed key
- Since any item we don't look at could be a match or not, we have to look at all items, so
O(n)
at least.
- Since it's sufficient to look at all items once, there's no reason to use a non-linear algorithm.
- Since it's only 64 values total, highly optimized SIMD instructions are probably not worth the hassle. It's unlikely to improve performance.
The precomputing of the keys and the masks are done by shifting the key left one bit at a time, and by setting the 4 relevant bits to 1 in the masks.
match = (bytes[i] & (0x0F << offset)) == (key << offset)
for some key
and each of the offsets 0-3
. Since (key << offset)
and (0x0F << offset)
will be reused every four loops, precomputing the four values and unrolling the loop will be faster. Your compiler might do this for you, but if not, here's how to do it manually:
matches = 0
for (int i = 0; i < 64; i += 4) {
const mask0 = 0x0F << 0
const key0 = key << 0
match0 = (bytes[i+0] & mask0) == key0
const mask1 = 0x0F << 1
const key1 = key << 1
match1 = (bytes[i+1] & mask1) == key1
...
matches += match0 + match1 + match2 + match3
}