A simpler way to implement @Toad's simple brute-force algorithm that checks every bit-position is to shift the data into place, instead of shifting a mask. There's no need for any arrays, much simpler is just to right shift combined >>= 1
inside the loop and compare the low 16 bits. (Either use a fixed mask, or cast to uint16_t
.)
(Across multiple problems, I've noticed that creating a mask tends to be less efficient than just shifting out the bits you don't want.)
(correctly handling the very last 16-bit chunk of an array of uint16_t
, or especially the last byte of an odd-sized byte array, is left as an exercise for the reader.)
// simple brute-force scalar version, checks every bit position 1 at a time.
long bitstream_search_rshift(uint8_t *buf, size_t len, unsigned short pattern)
{
uint16_t *bufshort = (uint16_t*)buf; // maybe unsafe type punning
len /= 2;
for (size_t i = 0 ; i<len-1 ; i++) {
//unsigned short curWord = bufshort[i];
//unsigned short prevWord = bufshort[i+1];
//int combinedWords = (prevWord<<16) + curWord;
uint32_t combined; // assumes little-endian
memcpy(&combined, bufshort+i, sizeof(combined)); // safe unaligned load
for(int bitpos=0; bitpos<16; bitpos++) {
if( (combined&0xFFFF) == pattern) // compiles more efficiently on e.g. old ARM32 without UBFX than (uint16_t)combined
return i*16 + bitpos;
combined >>= 1;
}
}
return -1;
}
This compiles significantly more efficiently than loading a mask from an array with recent gcc and clang for most ISAs, like x86, AArch64, and ARM.
Compilers fully unroll the loop by 16 so they can use bitfield-extract instructions with immediate operands (like ARM ubfx
unsigned bitfield extract or PowerPC rwlinm
rotate-left + immediate-mask a bit-range) to extract 16 bits to the bottom of a 32 or 64-bit register where they can do a regular compare-and-branch. There isn't actually a dependency chain of right shifts by 1.
On x86, the CPU can do a 16-bit compare that ignores high bits, e.g. cmp cx,dx
after right-shifting combined
in edx
Some compilers for some ISAs manage to do as good a job with @Toad's version as with this, e.g. clang for PowerPC manages to optimize away the array of masks, using rlwinm
to mask a 16-bit range of combined
using immediates, and it keeps all 16 pre-shifted pattern values in 16 registers, so either way it's just rlwinm / compare / branch whether the rlwinm has a non-zero rotate count or not. But the right-shift version doesn't need to set up 16 tmp registers. https://godbolt.org/z/8mUaDI
AVX2 brute-force
There are (at least) 2 ways to do this:
- broadcast a single dword and use variable shifts to check all bit-positions of it before moving on. Potentially very easy to figure out what position you found a match. (Maybe less good if if you want to count all matches.)
- vector load, and iterate over bit-positions of multiple windows of data in parallel. Maybe do overlapping odd/even vectors using unaligned loads starting at adjacent words (16-bit), to get dword (32-bit) windows. Otherwise you'd have to shuffle across 128-bit lanes, preferably with 16-bit granularity, and that would require 2 instructions without AVX512.
With 64-bit element shifts instead of 32, we could check multiple adjacent 16-bit windows instead of always ignoring the upper 16 (where zeros are shifted in). But we still have a break at SIMD element boundaries where zeros are shifted in, instead of actual data from a higher address. (Future solution: AVX512VBMI2 double-shifts like VPSHRDW
, a SIMD version of SHRD
.)
Maybe it's worth doing this anyway, then coming back for the 4x 16-bit elements we missed at the top of each 64-bit element in a __m256i
. Maybe combining leftovers across multiple vectors.
// simple brute force, broadcast 32 bits and then search for a 16-bit match at bit offset 0..15
#ifdef __AVX2__
#include <immintrin.h>
long bitstream_search_avx2(uint8_t *buf, size_t len, unsigned short pattern)
{
__m256i vpat = _mm256_set1_epi32(pattern);
len /= 2;
uint16_t *bufshort = (uint16_t*)buf;
for (size_t i = 0 ; i<len-1 ; i++) {
uint32_t combined; // assumes little-endian
memcpy(&combined, bufshort+i, sizeof(combined)); // safe unaligned load
__m256i v = _mm256_set1_epi32(combined);
// __m256i vlo = _mm256_srlv_epi32(v, _mm256_set_epi32(7,6,5,4,3,2,1,0));
// __m256i vhi = _mm256_srli_epi32(vlo, 8);
// shift counts set up to match lane ordering for vpacksswb
// SRLVD cost: Skylake: as fast as other shifts: 1 uop, 2-per-clock
// * Haswell: 3 uops
// * Ryzen: 1 uop, but 3c latency and 2c throughput. Or 4c / 4c for ymm 2 uop version
// * Excavator: latency worse than PSRLD xmm, imm8 by 1c, same throughput. XMM: 3c latency / 1c tput. YMM: 3c latency / 2c tput. (http://users.atw.hu/instlatx64/AuthenticAMD0660F51_K15_BristolRidge_InstLatX64.txt) Agner's numbers are different.
__m256i vlo = _mm256_srlv_epi32(v, _mm256_set_epi32(11,10,9,8, 3,2,1,0));
__m256i vhi = _mm256_srlv_epi32(v, _mm256_set_epi32(15,14,13,12, 7,6,5,4));
__m256i cmplo = _mm256_cmpeq_epi16(vlo, vpat); // low 16 of every 32-bit element = useful
__m256i cmphi = _mm256_cmpeq_epi16(vhi, vpat);
__m256i cmp_packed = _mm256_packs_epi16(cmplo, cmphi); // 8-bit elements, preserves sign bit
unsigned cmpmask = _mm256_movemask_epi8(cmp_packed);
cmpmask &= 0x55555555; // discard odd bits
if (cmpmask) {
return i*16 + __builtin_ctz(cmpmask)/2;
}
}
return -1;
}
#endif
This is good for searches that normally find a hit quickly, especially in less than the first 32 bytes of data. It's not bad for big searches (but is still pure brute force, only checking 1 word at a time), and on Skylake maybe not worse than checking 16 offsets of multiple windows in parallel.
This is tuned for Skylake, on other CPUs, where variable-shifts are less efficient, you might consider just 1 variable shift for offsets 0..7, and then create offsets 8..15 by shifting that. Or something else entirely.
This compiles surprisingly well with gcc/clang (on Godbolt), with an inner loop that broadcasts straight from memory. (Optimizing the memcpy
unaligned load and the set1()
into a single vpbroadcastd
)
Also included on the Godbolt link is a test main
that runs it on a small array. (I may not have tested since the last tweak, but I did test it earlier and the packing + bit-scan stuff does work.)
## clang8.0 -O3 -march=skylake inner loop
.LBB0_2: # =>This Inner Loop Header: Depth=1
vpbroadcastd ymm3, dword ptr [rdi + 2*rdx] # broadcast load
vpsrlvd ymm4, ymm3, ymm1
vpsrlvd ymm3, ymm3, ymm2 # shift 2 ways
vpcmpeqw ymm4, ymm4, ymm0
vpcmpeqw ymm3, ymm3, ymm0 # compare those results
vpacksswb ymm3, ymm4, ymm3 # pack to 8-bit elements
vpmovmskb ecx, ymm3 # scalar bitmask
and ecx, 1431655765 # see if any even elements matched
jne .LBB0_4 # break out of the loop on found, going to a tzcnt / ... epilogue
add rdx, 1
add r8, 16 # stupid compiler, calculate this with a multiply on a hit.
cmp rdx, rsi
jb .LBB0_2 # } while(i<len-1);
# fall through to not-found.
That's 8 uops of work + 3 uops of loop overhead (assuming macro-fusion of and/jne, and of cmp/jb, which we'll get on Haswell/Skylake). On AMD where 256-bit instructions are multiple uops, it'll be more.
Or of course using plain right-shift immediate to shift all elements by 1, and check multiple windows in parallel instead of multiple offsets in the same window.
Without efficient variable-shift (especially without AVX2 at all), that would be better for big searches, even if it requires a bit more work to sort out where the first hit is located in case there is a hit. (After finding a hit somewhere other than the lowest element, you need to check all remaining offsets of all earlier windows.)