39

I need to scan for a 16 bit word in a bit stream. It is not guaranteed to be aligned on byte or word boundaries.

What is the fastest way of achieving this? There are various brute force methods; using tables and/or shifts but are there any "bit twiddling shortcuts" that can cut down the number of calculations by giving yes/no/maybe contains the flag results for each byte or word as it arrives?

C code, intrinsics, x86 machine code would all be interesting.

starblue
  • 55,348
  • 14
  • 97
  • 151

13 Answers13

27

Using simple brute force is sometimes good.

I think precalc all shifted values of the word and put them in 16 ints so you got an array like this (assuming int is twice as wide as short)

 unsigned short pattern = 1234;
 unsigned int preShifts[16];
 unsigned int masks[16];
 int i;
 for(i=0; i<16; i++)
 {
      preShifts[i] = (unsigned int)(pattern<<i);  //gets promoted to int
      masks[i] = (unsigned int) (0xffff<<i);
 }

and then for every unsigned short you get out of the stream, make an int of that short and the previous short and compare that unsigned int to the 16 unsigned int's. If any of them match, you got one.

So basically like this:

  int numMatch(unsigned short curWord, unsigned short prevWord)
  {
       int numHits = 0;
       int combinedWords = (prevWord<<16) + curWord;

       int i=0;
       for(i=0; i<16; i++)
       {
             if((combinedWords & masks[i]) == preShifsts[i]) numHits++;
       }
       return numHits;
  }

Do note that this could potentially mean multiple hits when the patterns is detected more than once on the same bits:

e.g. 32 bits of 0's and the pattern you want to detect is 16 0's, then it would mean the pattern is detected 16 times!


The time cost of this, assuming it compiles approximately as written, is 16 checks per input word. Per input bit, this does one & and ==, and branch or other conditional increment. And also a table lookup for the mask for every bit.

The table lookup is unnecessary; by instead right-shifting combined we get significantly more efficient asm, as shown in another answer which also shows how to vectorize this with SIMD on x86.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
Toad
  • 15,593
  • 16
  • 82
  • 128
  • 1
    This is more along the lines of what I was thinking and quite possibly as good as it will get, especially if SIMD intrinsic will allow multiple comparisons across multiple word. The "flag" is non-zero and unique within the bit stream. –  Oct 15 '09 at 13:55
  • 2
    Cool. If you use SIMD you could potentially search even faster by putting the pattern a few times 'next to eachother' in the huge registers. This way you can compare something like 4 unsigned int's at the same time. – Toad Oct 15 '09 at 13:59
  • I think the general idea is workable, but the details need some refinement. A pattern might span up to 3 bytes so the 'match' table would have to contain items that can hold 24 bit values (probably unsigned ints); you'd need to represent all of the possible values in the 'don't care' bits, so you'd need 16*256=4096 items your match table. A binary search might be in order. Finally you have to build the 3 byte value to look up on a byte-by-byte basis, not on a short-by-short basis. On the whole, I think that mouviciel's state machine approach would be simpler and possibly faster. – Michael Burr Oct 15 '09 at 15:02
  • 2
    michael: I don't think you understand the logic completely. The 16bit pattern is shifted 16 times into 16 unsigned ints creating every possible variation of the 16bit word. By getting 16 bits a time and masking this and checking it against the 16 possible patterns you have done all the checks you need. – Toad Oct 15 '09 at 15:05
  • @reinier - I see - you're right. But the point about working byte-by-byte through the input stream stands (I think). Otherwise you'd miss the situation where a pattern spanned a short boundary in the input stream. – Michael Burr Oct 15 '09 at 15:12
  • @michael: No it won't miss it since I'm checking unsigned int's at a time. So it doesn't matter where the pattern of 16bits is shifted inside this 32 bit unsigned int – Toad Oct 15 '09 at 15:20
  • Another slightly tricky thing that needs to be handled (related to getting multiple matches) is to handle the 'remaining' bits in the stream after a match (assuming that you're interested in continuing the search for further matches). I think this just requires that the next matching loop start at the correct offset in the combinedWords/masks arrays (the one after the index where the current match was found, assuming you don't want to consider any of the bits in the current match for the next match). Not rocket science, but something that might not be obvious at first glance. – Michael Burr Oct 15 '09 at 15:20
  • @reinier - damn- I should go back to sleep. – Michael Burr Oct 15 '09 at 15:23
  • `pattern` should be declared `1234U`, but other than that... this is pretty much what I would have done. – caf Oct 15 '09 at 20:38
  • I see an issue with the above algorithm, possible case of (false) double counting if t here is a pattern match at short byte boundary (alignment). E.g. if the pattern is xaabb and the stream is x1122aabb3344. The first call to numMatch(xaabb, x1122) will return 1. Second call to numMatch(x3344, xaabb) will return 1. For a total match of 2. – Ahmed A May 14 '14 at 06:05
  • Since we precalc the pattern by shifting 0 bits upto and including 15 bits, I don't see how we can match the pattern which is 16 bits further in the stream as you suggest. – Toad May 14 '14 at 21:26
  • Some platforms have `int` = `short` both being 16 bit. `uint32_t` would be a more portable choice, or even `uint_least32_t` for even more portability (to platforms that don't have a type that's exactly 32 bits). – Peter Cordes Mar 03 '19 at 22:47
  • **There's no need to pre-calc shifted mask and pattern, instead do `combined >>= 1` inside the bit-loop**. And test `(combined & 0xFFFF) == pattern`. Right shifts or bitfield extracts are very cheap, and on architectures without enough registers to keep 16 shifted patterns in 16 different registers, some compilers actually load or compare with memory. https://godbolt.org/z/kKEnWs Compilers unroll the right shift loop to a bitfield-extract on ISAs like ARM (`ubfx`), but on x86 compilers can just do a 16-bit compare of the low 16-bits of a register. (`cmp cx,dx`). – Peter Cordes May 01 '19 at 20:52
  • [Posted an answer](https://stackoverflow.com/questions/1572290/fastest-way-to-scan-for-bit-pattern-in-a-stream-of-bits/56003691#56003691) with a version that does the same work but without arrays, which compiles to more efficient asm on x86 and most other ISAs. And @hplbsh: my answer also includes an AVX2 version that does the brute-force check of all 16 bit offsets in 8 uops + loop overhead. – Peter Cordes May 06 '19 at 10:52
17

Here is a trick to speed up the search by a factor of 32, if neither the Knuth-Morris-Pratt algorithm on the alphabet of two characters {0, 1} nor reinier's idea are fast enough.

You can first use a table with 256 entries to check for each byte in your bit stream if it is contained in the 16-bit word you are looking for. The table you get with

unsigned char table[256];
for (int i=0; i<256; i++)
  table[i] = 0; // initialize with false
for (i=0; i<8; i++)
  table[(word >> i) & 0xff] = 1; // mark contained bytes with true

You can then find possible positions for matches in the bit stream using

for (i=0; i<length; i++) {
  if (table[bitstream[i]]) {
    // here comes the code which checks if there is really a match
  }
}

As at most 8 of the 256 table entries are not zero, in average you have to take a closer look only at every 32th position. Only for this byte (combined with the bytes one before and one after) you have then to use bit operations or some masking techniques as suggested by reinier to see if there is a match.

The code assumes that you use little endian byte order. The order of the bits in a byte can also be an issue (known to everyone who already implemented a CRC32 checksum).

Whoever
  • 33
  • 1
  • 7
10

I would like to suggest a solution using 3 lookup tables of size 256. This would be efficient for large bit streams. This solution takes 3 bytes in a sample for comparison. Following figure shows all possible arrangements of a 16 bit data in 3 bytes. Each byte region has shown in different color.

alt text http://img70.imageshack.us/img70/8711/80541519.jpg

Here checking for 1 to 8 will be taken care in first sample and 9 to 16 in next sample and so on. Now when we are searching for a Pattern, we will find all the 8 possible arrangements (as below) of this Pattern and will store in 3 lookup tables (Left, Middle and Right).

Initializing Lookup Tables:

Lets take an example 0111011101110111 as a Pattern to find. Now consider 4th arrangement. Left part would be XXX01110. Fill all raws of Left lookup table pointing by left part (XXX01110) with 00010000. 1 indicates starting position of arrangement of input Pattern. Thus following 8 raws of Left look up table would be filled by 16 (00010000).

00001110
00101110
01001110
01101110
10001110
10101110
11001110
11101110

Middle part of arrangement would be 11101110. Raw pointing by this index (238) in Middle look up table will be filled by 16 (00010000).

Now Right part of arrangement would be 111XXXXX. All raws (32 raws) with index 111XXXXX will be filled by 16 (00010000).

We should not overwrite elements in look up table while filling. Instead do a bitwise OR operation to update an already filled raw. In above example, all raws written by 3rd arrangement would be updated by 7th arrangement as follows.

alt text

Thus raws with index XX011101 in Left lookup table and 11101110 in Middle lookup table and 111XXXXX in Right lookup table will be updated to 00100010 by 7th arrangement.

Searching Pattern:

Take a sample of three bytes. Find Count as follows where Left is left lookup table, Middle is middle lookup table and Right is right lookup table.

Count = Left[Byte0] & Middle[Byte1] & Right[Byte2];

Number of 1 in Count gives the number of matching Pattern in taken sample.

I can give some sample code which is tested.

Initializing lookup table:

    for( RightShift = 0; RightShift < 8; RightShift++ )
    {
        LeftShift = 8 - RightShift;

        Starting = 128 >> RightShift;

        Byte = MSB >> RightShift;

        Count = 0xFF >> LeftShift;

        for( i = 0; i <= Count; i++ )
        {
            Index = ( i << LeftShift ) | Byte;

            Left[Index] |= Starting;
        }

        Byte = LSB << LeftShift;

        Count = 0xFF >> RightShift;

        for( i = 0; i <= Count; i++ )
        {
            Index = i | Byte;

            Right[Index] |= Starting;
        }

        Index = ( unsigned char )(( Pattern >> RightShift ) & 0xFF );

        Middle[Index] |= Starting;
    }

Searching Pattern:

Data is stream buffer, Left is left lookup table, Middle is middle lookup table and Right is right lookup table.

for( int Index = 1; Index < ( StreamLength - 1); Index++ )
{
    Count = Left[Data[Index - 1]] & Middle[Data[Index]] & Right[Data[Index + 1]];

    if( Count )
    {
        TotalCount += GetNumberOfOnes( Count );
    }
}

Limitation:

Above loop cannot detect a Pattern if it is placed at the very end of stream buffer. Following code need to add after loop to overcome this limitation.

Count = Left[Data[StreamLength - 2]] & Middle[Data[StreamLength - 1]] & 128;

if( Count )
{
    TotalCount += GetNumberOfOnes( Count );
}

Advantage:

This algorithm takes only N-1 logical steps to find a Pattern in an array of N bytes. Only overhead is to fill the lookup tables initially which is constant in all the cases. So this will be very effective for searching huge byte streams.

Glorfindel
  • 21,988
  • 13
  • 81
  • 109
Vadakkumpadath
  • 1,445
  • 13
  • 21
  • 2
    great that solutions still keep coming in. By the way, your name sounds awesome. Could be the name of a character in a star wars movie ;^) – Toad Oct 17 '09 at 12:37
  • 2
    Two optimizations I can see. First would be to check middle before looking up left and right, skipping the left and right checks if middle doesn't look good. A second optimization would be to use one 256x32-bit table rather than three 256x8 tables. This would replace three one-byte dereferences with one longword deference (the matching formula would be (byte2_lookup >> 16) & (byte1_lookup >> 8) & (byte0_lookup)). Note that if a lookup value is zero, one may be able to skip a byte provided one is prepared to backtrack upon finding a non-zero lookup value. – supercat Jan 11 '11 at 16:35
  • Is "raw" supposed to be "row"? Terminology seems strange, but the basic idea of lookup tables to see if a byte is part of a match at some offset seems good. We might be able to adapt this for SIMD lookups in 4-bit chunks, e.g. SSSE3 `pshufb` to do 16 lookups in parallel from a vector of bytes, using 4-bit indices. – Peter Cordes May 01 '19 at 19:01
9

My money's on Knuth-Morris-Pratt with an alphabet of two characters.

atomice
  • 3,062
  • 17
  • 23
  • That seems to be for byte aligned characters in strings? –  Oct 15 '09 at 13:27
  • 4
    The "alphabet of two characters" means that you turn the bytestream into a bitstream. – MSalters Oct 15 '09 at 13:32
  • 2
    the word knuth automatcially add's points to the answer. ;^) I think their algorithm has nothing to do with the bit matching. – Toad Oct 15 '09 at 13:44
  • 6
    This performs poorly, as KMP relies on skipping forward. However, when searching for 10..... you can't skip anything. You have a mismatch when you encounter 11......, yet the string could be 110.... so must check the second 1 in the search string against the first 1 of your pattern. It does work well if your pattern is e.g. 11111.....; whenever you encounter a 0 you can skip 5 ahead. KMP works best if the alphabet is large, e.g. Unicode. A two-character alphabet is the worst case for KMP (but more search algorithms suffer from small alphabets) – MSalters Oct 15 '09 at 13:53
  • 5
    This is overkill for his problem because the KMP algorithm works for arbitrarily long words, where his word has a fixed length of 16 bits. KMP is too flexible for this problem and will not be as efficient as simpler approaches. – A. Levy Oct 15 '09 at 17:08
  • Answering the question exactly requires a lot of hard work, so I'll just call this a bet rather than an answer. –  Oct 16 '09 at 01:03
  • The runtime overhead of pulling out one bit at a time would be comparable to the entire workload in a solution like reinier's and you haven't even started the KMP work at that point. On top of that, an alphabet of only two characters and a short, fixed-length search pattern is the worst case for KMP. Remember that there's more to life than big-O analysis. – Alan Oct 16 '09 at 06:43
7

I would implement a state machine with 16 states.

Each state represents how many received bits conform to the pattern. If the next received bit conform to the next bit of the pattern, the machine steps to the next state. If this is not the case, the machine steps back to the first state (or to another state if the beginning of the pattern can be matched with a smaller number of received bits).

When the machine reaches the last state, this indicates that the pattern has been identified in the bit stream.

mouviciel
  • 66,855
  • 13
  • 106
  • 140
  • This would be as fast as the input bit stream. – mouviciel Oct 15 '09 at 14:07
  • Probably not as fast as the versions that try to match multiple versions of the pattern at once, but plenty fast for any practical application. – Mark Bessey Oct 15 '09 at 14:47
  • 1
    This would likely be pretty fast an have the other advantage of being dead simple so less chance for boundary condition bugs. The only tricky part is handling partial matches - you have to reset the state 'back somewhere in the stream'. The simple method of just backing up the data stream could cause things to be slow for pathological streams (or patterns) where you get a lot of 15 bit matches. It's possible to build into the state machine the right state to restart matching without having to re-check bits, but that would make the state machine trickier to build (I think). – Michael Burr Oct 15 '09 at 15:09
  • This is tricky in the general case. But if the pattern is a constant of the problem, the state reset can be hardcoded. Or you can put the tricky part at initialisation step with a state-machine generation function taking the pattern as argument. – mouviciel Oct 15 '09 at 15:14
  • @ziggystar: can you elaborate please? – mouviciel Jan 06 '11 at 15:55
  • @mouviciel I'm sorry. I didn't read your answer properly. I've deleted my old wrong comment. – ziggystar Jan 07 '11 at 09:32
4

atomice's

looked good until I considered Luke and MSalter's requests for more information about the particulars.

Turns out the particulars might indicate a quicker approach than KMP. The KMP article links to

for a particular case when the search pattern is 'AAAAAA'. For a multiple pattern search, the

might be most suitable.

You can find further introductory discussion here.

Ewan Todd
  • 7,315
  • 26
  • 33
  • 2
    still... these all relate to text searches and nothing on bitlevel. Turning the bitstream into a bytestream first might be very costly (memory/processortime) and impractical. – Toad Oct 15 '09 at 14:03
  • Reinier, I'm concentrating on the string search algorithm here. You point our that the bitwise masking ops are not free. For now, I assume that they are comparably expensive for each algorithm. My main point, though, is that the specifics of the application may allow us to beat KMP. – Ewan Todd Oct 15 '09 at 14:14
  • KMP only can test 1 letter at a time. With bits you can test 16 (or with SIMD 64) at a time. This would make KMP or any other letter based algorithm useless. – Toad Oct 15 '09 at 14:17
  • 1
    I don't understand your insistence that KMP is restricted to byte sized ops. I have in mind a bitwise KMP, which shifts right by one bit on mismatch. – Ewan Todd Oct 15 '09 at 14:57
  • 2
    Google "Adapting the Knuth–Morris–Pratt algorithm for pattern matching in Huffman encoded texts" – Ewan Todd Oct 15 '09 at 15:02
3

Seems like a good use for SIMD instructions. SSE2 added a bunch of integer instructions for crunching multiple integers at the same time, but I can't imagine many solutions for this that don't involve a lot of bit shifts since your data isn't going to be aligned. This actually sounds like something an FPGA should be doing.

ajs410
  • 2,384
  • 2
  • 19
  • 14
  • 1
    Emulating what an FPGA does in software may be an interesting place to start. –  Oct 15 '09 at 13:51
  • 1
    All the FPGA would do is shift bits and compare, it's just that a circuit doesn't care about byte alignment, and it can shift the next bit in while it's comparing against the current bit. – ajs410 Oct 15 '09 at 14:02
3

What I would do is create 16 prefixes and 16 suffixes. Then for each 16 bit input chunk determine the longest suffix match. You've got a match if the next chunk has a prefix match of length (16-N)

A suffix match doesn't actually 16 comparisons. However, this takes pre-calculation based upon the pattern word. For example, if the patternword is 101010101010101010, you can first test the last bit of your 16 bit input chunk. If that bit is 0, you only need to test the ...10101010 suffices. If the last bit is 1, you need to test the ...1010101 suffices. You've got 8 of each, for a total of 1+8 comparisons. If the patternword is 1111111111110000, you'd still test the last bit of your input for a suffix match. If that bit is 1, you have to do 12 suffix matches (regex: 1{1,12}) but if it's 0 you have only 4 possible matches (regex 1111 1111 1111 0{1,4}), again for an average of 9 tests. Add the 16-N prefix match, and you see that you only need 10 checks per 16 bit chunk.

MSalters
  • 173,980
  • 10
  • 155
  • 350
  • Nice ideas in here. I'm going to mull over the problem for a little while and perhaps try this and the one from Reiner. –  Oct 15 '09 at 16:30
3

For a general-purpose, non-SIMD algorithm you are unlikely to be able to do much better than something like this:

unsigned int const pattern = pattern to search for
unsigned int accumulator = first three input bytes

do
{
  bool const found = ( ((accumulator   ) & ((1<<16)-1)) == pattern )
                   | ( ((accumulator>>1) & ((1<<16)-1)) == pattern );
                   | ( ((accumulator>>2) & ((1<<16)-1)) == pattern );
                   | ( ((accumulator>>3) & ((1<<16)-1)) == pattern );
                   | ( ((accumulator>>4) & ((1<<16)-1)) == pattern );
                   | ( ((accumulator>>5) & ((1<<16)-1)) == pattern );
                   | ( ((accumulator>>6) & ((1<<16)-1)) == pattern );
                   | ( ((accumulator>>7) & ((1<<16)-1)) == pattern );
  if( found ) { /* pattern found */ }
  accumulator >>= 8;

  unsigned int const data = next input byte
  accumulator |= (data<<8);
} while( there is input data left );
moonshadow
  • 86,889
  • 7
  • 82
  • 122
3

You can use the fast fourier transform for extremely large input (value of n) to find any bit pattern in O(n log n ) time. Compute the cross-correlation of a bit mask with the input. Cross -correlation of a sequence x and a mask y with a size n and n' respectively is defined by

R(m) = sum  _ k = 0 ^ n' x_{k+m} y_k

then occurences of your bit pattern that match the mask exactly where R(m) = Y where Y is the sum of one's in your bit mask.

So if you are trying to match for the bit pattern

[0 0 1 0 1 0]

in

[ 1 1 0 0 1 0 1 0 0 0 1 0 1 0 1]

then you must use the mask

[-1 -1  1 -1  1 -1]

the -1's in the mask guarantee that those places must be 0.

You can implement cross-correlation, using the FFT in O(n log n ) time.

I think KMP has O(n + k) runtime, so it beats this out.

ldog
  • 11,707
  • 10
  • 54
  • 70
3

A simpler way to implement @Toad's simple brute-force algorithm that checks every bit-position is to shift the data into place, instead of shifting a mask. There's no need for any arrays, much simpler is just to right shift combined >>= 1 inside the loop and compare the low 16 bits. (Either use a fixed mask, or cast to uint16_t.)

(Across multiple problems, I've noticed that creating a mask tends to be less efficient than just shifting out the bits you don't want.)

(correctly handling the very last 16-bit chunk of an array of uint16_t, or especially the last byte of an odd-sized byte array, is left as an exercise for the reader.)

// simple brute-force scalar version, checks every bit position 1 at a time.
long bitstream_search_rshift(uint8_t *buf, size_t len, unsigned short pattern)
{
        uint16_t *bufshort = (uint16_t*)buf;  // maybe unsafe type punning
        len /= 2;
        for (size_t i = 0 ; i<len-1 ; i++) {
                //unsigned short curWord = bufshort[i];
                //unsigned short prevWord = bufshort[i+1];
                //int combinedWords = (prevWord<<16) + curWord;

                uint32_t combined;                                // assumes little-endian
                memcpy(&combined, bufshort+i, sizeof(combined));  // safe unaligned load

                for(int bitpos=0; bitpos<16; bitpos++) {
                        if( (combined&0xFFFF) == pattern)     // compiles more efficiently on e.g. old ARM32 without UBFX than (uint16_t)combined
                                return i*16 + bitpos;
                        combined >>= 1;
                }
        }
        return -1;
}

This compiles significantly more efficiently than loading a mask from an array with recent gcc and clang for most ISAs, like x86, AArch64, and ARM.

Compilers fully unroll the loop by 16 so they can use bitfield-extract instructions with immediate operands (like ARM ubfx unsigned bitfield extract or PowerPC rwlinm rotate-left + immediate-mask a bit-range) to extract 16 bits to the bottom of a 32 or 64-bit register where they can do a regular compare-and-branch. There isn't actually a dependency chain of right shifts by 1.

On x86, the CPU can do a 16-bit compare that ignores high bits, e.g. cmp cx,dx after right-shifting combined in edx

Some compilers for some ISAs manage to do as good a job with @Toad's version as with this, e.g. clang for PowerPC manages to optimize away the array of masks, using rlwinm to mask a 16-bit range of combined using immediates, and it keeps all 16 pre-shifted pattern values in 16 registers, so either way it's just rlwinm / compare / branch whether the rlwinm has a non-zero rotate count or not. But the right-shift version doesn't need to set up 16 tmp registers. https://godbolt.org/z/8mUaDI


AVX2 brute-force

There are (at least) 2 ways to do this:

  • broadcast a single dword and use variable shifts to check all bit-positions of it before moving on. Potentially very easy to figure out what position you found a match. (Maybe less good if if you want to count all matches.)
  • vector load, and iterate over bit-positions of multiple windows of data in parallel. Maybe do overlapping odd/even vectors using unaligned loads starting at adjacent words (16-bit), to get dword (32-bit) windows. Otherwise you'd have to shuffle across 128-bit lanes, preferably with 16-bit granularity, and that would require 2 instructions without AVX512.

With 64-bit element shifts instead of 32, we could check multiple adjacent 16-bit windows instead of always ignoring the upper 16 (where zeros are shifted in). But we still have a break at SIMD element boundaries where zeros are shifted in, instead of actual data from a higher address. (Future solution: AVX512VBMI2 double-shifts like VPSHRDW, a SIMD version of SHRD.)

Maybe it's worth doing this anyway, then coming back for the 4x 16-bit elements we missed at the top of each 64-bit element in a __m256i. Maybe combining leftovers across multiple vectors.

// simple brute force, broadcast 32 bits and then search for a 16-bit match at bit offset 0..15

#ifdef __AVX2__
#include <immintrin.h>
long bitstream_search_avx2(uint8_t *buf, size_t len, unsigned short pattern)
{
    __m256i vpat = _mm256_set1_epi32(pattern);

    len /= 2;
    uint16_t *bufshort = (uint16_t*)buf;
    for (size_t i = 0 ; i<len-1 ; i++) {
        uint32_t combined; // assumes little-endian
        memcpy(&combined, bufshort+i, sizeof(combined));  // safe unaligned load
        __m256i v = _mm256_set1_epi32(combined);
//      __m256i vlo = _mm256_srlv_epi32(v, _mm256_set_epi32(7,6,5,4,3,2,1,0));
//      __m256i vhi = _mm256_srli_epi32(vlo, 8);

        // shift counts set up to match lane ordering for vpacksswb

        // SRLVD cost: Skylake: as fast as other shifts: 1 uop, 2-per-clock
        // * Haswell: 3 uops
        // * Ryzen: 1 uop, but 3c latency and 2c throughput.  Or 4c / 4c for ymm 2 uop version
        // * Excavator: latency worse than PSRLD xmm, imm8 by 1c, same throughput. XMM: 3c latency / 1c tput.  YMM: 3c latency / 2c tput.  (http://users.atw.hu/instlatx64/AuthenticAMD0660F51_K15_BristolRidge_InstLatX64.txt)  Agner's numbers are different.
        __m256i vlo = _mm256_srlv_epi32(v, _mm256_set_epi32(11,10,9,8,    3,2,1,0));
        __m256i vhi = _mm256_srlv_epi32(v, _mm256_set_epi32(15,14,13,12,  7,6,5,4));

        __m256i cmplo = _mm256_cmpeq_epi16(vlo, vpat);  // low 16 of every 32-bit element = useful
        __m256i cmphi = _mm256_cmpeq_epi16(vhi, vpat);

        __m256i cmp_packed = _mm256_packs_epi16(cmplo, cmphi); // 8-bit elements, preserves sign bit
        unsigned cmpmask = _mm256_movemask_epi8(cmp_packed);
        cmpmask &= 0x55555555;  // discard odd bits
        if (cmpmask) {
            return i*16 + __builtin_ctz(cmpmask)/2;
        }
    }
    return -1;
}
#endif

This is good for searches that normally find a hit quickly, especially in less than the first 32 bytes of data. It's not bad for big searches (but is still pure brute force, only checking 1 word at a time), and on Skylake maybe not worse than checking 16 offsets of multiple windows in parallel.

This is tuned for Skylake, on other CPUs, where variable-shifts are less efficient, you might consider just 1 variable shift for offsets 0..7, and then create offsets 8..15 by shifting that. Or something else entirely.

This compiles surprisingly well with gcc/clang (on Godbolt), with an inner loop that broadcasts straight from memory. (Optimizing the memcpy unaligned load and the set1() into a single vpbroadcastd)

Also included on the Godbolt link is a test main that runs it on a small array. (I may not have tested since the last tweak, but I did test it earlier and the packing + bit-scan stuff does work.)

## clang8.0  -O3 -march=skylake  inner loop
.LBB0_2:                                # =>This Inner Loop Header: Depth=1
        vpbroadcastd    ymm3, dword ptr [rdi + 2*rdx]   # broadcast load
        vpsrlvd ymm4, ymm3, ymm1
        vpsrlvd ymm3, ymm3, ymm2             # shift 2 ways
        vpcmpeqw        ymm4, ymm4, ymm0
        vpcmpeqw        ymm3, ymm3, ymm0     # compare those results
        vpacksswb       ymm3, ymm4, ymm3     # pack to 8-bit elements
        vpmovmskb       ecx, ymm3            # scalar bitmask
        and     ecx, 1431655765              # see if any even elements matched
        jne     .LBB0_4            # break out of the loop on found, going to a tzcnt / ... epilogue
        add     rdx, 1
        add     r8, 16           # stupid compiler, calculate this with a multiply on a hit.
        cmp     rdx, rsi
        jb      .LBB0_2                    # } while(i<len-1);
        # fall through to not-found.

That's 8 uops of work + 3 uops of loop overhead (assuming macro-fusion of and/jne, and of cmp/jb, which we'll get on Haswell/Skylake). On AMD where 256-bit instructions are multiple uops, it'll be more.


Or of course using plain right-shift immediate to shift all elements by 1, and check multiple windows in parallel instead of multiple offsets in the same window.

Without efficient variable-shift (especially without AVX2 at all), that would be better for big searches, even if it requires a bit more work to sort out where the first hit is located in case there is a hit. (After finding a hit somewhere other than the lowest element, you need to check all remaining offsets of all earlier windows.)

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
  • Looks like a great speed up. +1 for actually checking the assembly code. I always hear claims that 'the compiler is smart and will do something clever', but I've seen in the past that that is not always the case. Good to see that the current compilers are actually quite clever. – Toad May 07 '19 at 12:29
  • @Toad: Yeah, compilers usually make code that's not terrible, but the level of faith in compilers that some advocate for is ridiculous. They often need some hand-holding! And yes, clang optimizing away a couple arrays is on the more-clever end of things. But that doesn't happen for all architectures. For some like x86, they auto-vectorize the array init and then use it in a scalar loop. Even ICC didn't manage to auto-vectorize the search itself. (And ICC's scalar loop isn't remotely clever; it's not great except at auto-vectorizing). So the AVX2 version had to be manually vectorized. – Peter Cordes May 07 '19 at 13:05
2

Maybe you should stream in your bit stream in a vector (vec_str), stream in your pattern in another vector (vec_pattern) and then do something like the algorithm below

i=0
while i<vec_pattern.length
    j=0
    while j<vec_str.length
            if (vec_str[j] xor vec_pattern[i])
                i=0
                j++

(hope the algorithm is correct)

Ferenc Deak
  • 34,348
  • 17
  • 99
  • 167
  • using anything like classes on the bitlevel makes it terribly slow. Bitstreams will be no exception, especially if you are going to compare very bit separately. – Toad Oct 15 '09 at 13:55
  • 1
    I wouldn't bet against this without testing it. With C++ template expansion, this might well get optimized into something surprisingly fast. – Mark Bessey Oct 15 '09 at 14:54
  • 2
    mark: my experience is: always spell it out to the compiler and don't trust it's optimization magic too much in case of going for the optimal cycles. human brain beats compilers anytime. – Toad Oct 16 '09 at 07:29
1

A fast way to find the matches in big bitstrings would be to calculate a lookup table that shows the bit offsets where a given input byte matches the pattern. Then combining three consecutive offset matches together you can get a bit vector that shows which offsets match the whole pattern. For example if byte x matches first 3 bits of the pattern, byte x+1 matches bits 3..11 and byte x+2 matches bits 11..16, then there is a match at byte x + 5 bits.

Here's some example code that does this, accumulating the results for two bytes at a time:

void find_matches(unsigned char* sequence, int n_sequence, unsigned short pattern) {
    if (n_sequence < 2)
        return; // 0 and 1 byte bitstring can't match a short

    // Calculate a lookup table that shows for each byte at what bit offsets
    // the pattern could match.
    unsigned int match_offsets[256];
    for (unsigned int in_byte = 0; in_byte < 256; in_byte++) {
        match_offsets[in_byte] = 0xFF;
        for (int bit = 0; bit < 24; bit++) {
            match_offsets[in_byte] <<= 1;
            unsigned int mask = (0xFF0000 >> bit) & 0xFFFF;
            unsigned int match_location = (in_byte << 16) >> bit;
            match_offsets[in_byte] |= !((match_location ^ pattern) & mask);
        }
    }

    // Go through the input 2 bytes at a time, looking up where they match and
    // anding together the matches offsetted by one byte. Each bit offset then
    // shows if the input sequence is consistent with the pattern matching at
    // that position. This is anded together with the large offsets of the next
    // result to get a single match over 3 bytes.
    unsigned int curr, next;
    curr = 0;
    for (int pos = 0; pos < n_sequence-1; pos+=2) {
        next = ((match_offsets[sequence[pos]] << 8) | 0xFF) & match_offsets[sequence[pos+1]];
        unsigned short match = curr & (next >> 16);
        if (match)
            output_match(pos, match);
        curr = next;
    }
    // Handle the possible odd byte at the end
    if (n_sequence & 1) {
        next = (match_offsets[sequence[n_sequence-1]] << 8) | 0xFF;
        unsigned short match = curr & (next >> 16);
        if (match)
            output_match(n_sequence-1, match);
    }
}

void output_match(int pos, unsigned short match) {
    for (int bit = 15; bit >= 0; bit--) {
        if (match & 1) {
            printf("Bitstring match at byte %d bit %d\n", (pos-2) + bit/8, bit % 8);
        }
        match >>= 1;
    }
}

The main loop of this is 18 instructions long and processes 2 bytes per iteration. If the setup cost isn't an issue, this should be about as fast as it gets.

Ants Aasma
  • 53,288
  • 15
  • 90
  • 97
  • Trying do digest your code (not sure if I get it completely) I see one potential problem: if a byte is located at more than 1 place in the pattern then your lookuptable won't work since it can only store one place in the pattern. So given the byte: 00000000 and the pattern: 1000000000000011 it should give 5 locations, but it can only give 1, resulting in a possible miss of the pattern. Or am I missing something? – Toad Oct 16 '09 at 07:46
  • I'll represent bits in least to most significant order. match_offsets[0x00] = 4286594816. That is 00000000 11111100 00000001 11111111, the last 9 bits are not significant, the 6 set bits in the middle represent that the byte can match at pattern >> 1, pattern >> 2 .. pattern >> 6. (for reference least significant bit means that byte matches at pattern << 7 masking out 7 low bits, 8'th bit or bit7 means match at pattern << 0 or that the given byte equals the low byte of the pattern) – Ants Aasma Oct 16 '09 at 08:52