7

I have needed to implement a string searching algorithm that finds a pattern of bits in a text of bits (the match may not be byte/word aligned). For starters, I implemented the Boyer-Moore algorithm, but comparing individual bits was too slow for my purposes. So instead I tried implementing a blocked based version that would compare entire bytes/words as described in this paper, but it has become complex and unmanageable (in part due to me not entirely understanding what I'm doing.)

Does anyone have good implementation of such an algorithm?

My specific use case is with pattern length N >= 32, text window 2N, and bits packed into ints. Also N in this case is a multiple of char size N % 8 == 0. I preprocess once and use many times on changing text, like the Boyer-Moore. First match is all I need. Performance is key.

Edit: After successfully implementing the Blocked Boyer-Moore algorithm, I notice no improvement(my bit by bit version is faster!) It's probably a mistake of my own, because I've been racking my brain on it and optimized it to the point where it makes no sense without many lines of comments, yet it's still slower. Here it is.

scientiaesthete
  • 919
  • 9
  • 20
  • What is a typical value of `N`? Seeing that you specify `N >= 32`, i guess that the case `N = 32` is important for you; would it be appropriate to assume `N < 100`? – anatolyg Aug 27 '12 at 15:40
  • relevant: [Fastest way to scan for bit pattern in a stream of bits](http://stackoverflow.com/questions/1572290/fastest-way-to-scan-for-bit-pattern-in-a-stream-of-bits), – sehe Aug 27 '12 at 15:41
  • @anatolyg Upper bound is more like `N < 16384`. In case it helps, `N % 8 == 0`. – scientiaesthete Aug 27 '12 at 17:22
  • That simplifies the implementation, yes. – nneonneo Aug 27 '12 at 17:57
  • Few questions: (i) Do you want to reuse the pattern for many texts? (ii) What kind of memory limits do you have, or are you interested on only speed? (iii) Do you care about multiple matches, overlapping matches or worst-case performance? – nneonneo Aug 27 '12 at 18:49
  • @nneonneo (i) Preprocess once, reuse many times. (ii) Only so far as it doesn't cause excessive cache misses and effect performance. (iii) First match is enough. – scientiaesthete Aug 27 '12 at 19:36

3 Answers3

3

If N is big (bigger than, say, 16 bits) then it would be pretty easy to do a preliminary search against 8 shifted copies of the bit pattern (truncating the pattern to eliminate 'partial bytes'). Then you can refine the results by looking at adjacent bits. The byte search (against the 8 shifted copies) could be done using Boyer-Moore or a similarly efficient algorithm.

In case you are wondering: 8 byte searches is probably faster than one bit search since each byte comparison takes just one instruction, whereas the bit manipulation needed to do the bit search takes far more instructions per bit. However, you should always profile to be sure.

nneonneo
  • 171,345
  • 36
  • 312
  • 383
  • The thing is, if you are e.g. looking for 6 adjacent bits in a stream of octets, you would have 3 different positions, but 3*4 = 12 possible bytes that match (since the other 2 bits can assume any value). Worse so if you allow the pattern to overlap octets: the first octet could be any of 256 (gasp) possible values: https://docs.google.com/spreadsheet/ccc?key=0AlX1n8WSRNJWdGIwLU9CNEt2eS14cW1BbWsxeUNjT2c – sehe Aug 27 '12 at 15:10
  • That spreadsheet isn't publicly available. I'm not too certain about your logic here -- what do you mean by '3 different positions'? Only two consecutive octets can possibly match the pattern since 6 < 8. – nneonneo Aug 27 '12 at 16:38
  • (Fixed the sharing issue, sorry about that) The problem is that to match any 2 octets that might contain this bitsequence, you'll have a lot of possible 2-octect pairs that can match. – sehe Aug 27 '12 at 17:12
  • I see. But, as I wrote in my answer, we only use this technique if N is big. In the question's case, N >= 32, so we can search on the full octets only to filter the result set down. – nneonneo Aug 27 '12 at 17:56
3

Overview

I'm new to the SO community, but I look forward to giving something back.

Interesting problem. I put together an implementation which only does byte-based comparisons (with the aid of pre-computed bit patterns and bit masks) as opposed to performing expensive bit-manipulations on compare. As a result, it should be reasonably fast. It doesn't implement any of the Shift Rules (performance optimizations) discussed for the Boyer-Moore algorithm and thus could be further improved.

Although this implementation does depend on the number of pattern bits % CHAR_BIT == 0--on an 8-bit machine, meets your criteria of N % 8 == 0, the implementation will find non-byte-aligned bit patterns. (It also currently requires 8-bit characters ( CHAR_BIT == 8 ), but in the unlikely event your system is not using 8-bit chars, is easily adaptable by changing the all of the arrays/vectors from uint8_t to char, and adjusting the values they contain to reflect the correct number of bits.)

Given that the search doesn't do any bit-twiddling (besides anding pre-computed byte masks) it should be pretty performant.

Algorithm summary

In a nutshell, the pattern to be searched for is specified, and the implementation shifts it by one bit and records the shifted pattern. It also computes masks for the shifted pattern, as for non-byte-aligned bit patterns, some bits at the beginning and end of the comparison will need to be ignored for proper behavior.

The search is conducted for all the pattern bits in each shift position until a match is found or the end of the data buffer is reached.

//
//  BitStringMatch.cpp
//

#include "stdafx.h"
#include <iostream>
#include <cstdint>
#include <vector>
#include <memory>
#include <cassert>

int _tmain(int argc, _TCHAR* argv[])
{
    //Enter text and pattern data as appropriate for your application.  This implementation assumes pattern bits % CHAR_BIT == 0
    uint8_t text[] = { 0xcc, 0xcc, 0xcc, 0x5f, 0xe0, 0x1f, 0xe0, 0x0c }; //1010 1010, 1010 1010, 1010 1010, 010*1 1111, 1110 0000, 0001 1111, 1110 0000, 000*0 1010 
    uint8_t pattern[] = { 0xff, 0x00, 0xff, 0x00 }; //Set pattern to 1111 1111, 0000 0000, 1111 1111, 0000 0000

    assert( CHAR_BIT == 8 ); //Sanity check
    assert ( sizeof( text ) >= sizeof( pattern ) ); //Sanity check

    std::vector< std::vector< uint8_t > > shiftedPatterns( CHAR_BIT, std::vector< uint8_t >( sizeof( pattern ) + 1, 0 ) );  //+1 to accomodate bit shifting of CHAR_BIT bits.
    std::vector< std::pair< uint8_t, uint8_t > > compareMasks( CHAR_BIT, std::pair< uint8_t, uint8_t >( 0xff, 0x00 ) );

    //Initialize pattern shifting through all bit positions
    for( size_t i = 0; i < sizeof( pattern ); ++i ) //Start by initializing the unshifted pattern
    {
        shiftedPatterns[ 0 ][ i ] = pattern[ i ];
    }

    for( size_t i = 1; i < CHAR_BIT; ++i )  //Initialize the other patterns, shifting the previous vector pattern to the right by 1 bit position
    {
        compareMasks[ i ].first >>= i;  //Set the bits to consider in the first...
        compareMasks[ i ].second = 0xff << ( CHAR_BIT - i ); //and last bytes of the pattern

        bool underflow = false;
        for( size_t j = 0; j < sizeof( pattern ) + 1; ++j )
        {
            bool thisUnderflow = shiftedPatterns[ i - 1 ][ j ] & 0x01 ? true : false; 
            shiftedPatterns[ i ][ j ] = shiftedPatterns[ i - 1][ j ] >> 1;

            if( underflow ) //Previous byte shifted out a 1; shift in a 1
            {
                shiftedPatterns[ i ][ j ] |= 0x80;  //Set MSb to 1
            }

            underflow = thisUnderflow;
        }
    }

    //Search text for pattern
    size_t maxTextPos = sizeof( text ) - sizeof( pattern );
    size_t byte = 0;
    bool match = false;
    for( size_t byte = 0; byte <= maxTextPos && !match; ++byte )
    {
        for( size_t bit = 0; bit < CHAR_BIT && ( byte < maxTextPos || ( byte == maxTextPos && bit < 1 ) ); ++bit )
        {
            //Compare first byte of pattern
            if( ( shiftedPatterns[ bit ][ 0 ] & compareMasks[ bit ].first ) != ( text[ byte ] & compareMasks[ bit ].first ) )
            {
                continue;
            }

            size_t foo = sizeof( pattern );
            //Compare all middle bytes of pattern
            bool matchInProgress = true;
            for( size_t pos = 1; pos < sizeof( pattern ) && matchInProgress; ++pos )
            {
                matchInProgress = shiftedPatterns[ bit ][ pos ] == text[ byte + pos ];
            }
            if( !matchInProgress )
            {
                continue;
            }

            if( bit != 0 )  //If compare failed or we're comparing the unshifted pattern, there's no need to compare final pattern buffer byte
            {
                if( ( shiftedPatterns[ bit ][ sizeof( pattern ) ] & compareMasks[ bit ].second ) != ( text[ byte + sizeof( pattern ) ] & compareMasks[ bit ].second ) )
                {
                    continue;
                };
            }

            //We found a match!
            match = true;   //Abandon search
            std::cout << "Match found!  Pattern begins at byte index " << byte << ", bit position " << CHAR_BIT - bit - 1 << ".\n";
            break;
        }
    }
    //If no match
    if( !match )
    {
        std::cout << "No match found.\n";
    }

    std::cout << "\nPress a key to exit...";
    std::getchar();

    return 0;
}

I hope this is helpful.

Community
  • 1
  • 1
U007D
  • 5,772
  • 2
  • 38
  • 44
  • Your implementation is excellent, but for less memory usage and less initialization time you can change it a little: Our mask is 0xFF in entire array except first and last items, so you can completely ignore the mask array and replace it with two mask byte( for example `first` and `last` ) and only their value will tested for first and last items! but beside that your answer was complete. – BigBoss Sep 01 '12 at 02:24
  • @BigBoss if I'm understanding you correctly, the optimization you are describing is already part of the implementation. The above code only checks masks on the first byte of all patterns, and the last bytes of all patterns except the unshifted pattern. It neither creates nor checks 0xff masks for any intermediate pattern bytes. Did I misunderstand your comment? – U007D Sep 02 '12 at 05:19
0

Performance will be highly dependent on the types of bit patterns. For example if the "key" bit sequence and the "search" bit sequence is highly different, then some of the byte shifting solutions, or even your bit by bit solution will be pretty fast. Because at the vast majority of bit offsets, the first comparison will fail and you can move on to the next one.

If the sequences are highly similar then you need a more sophisticated algorithm. Imagine for example 1 million bits that are all 10101010101010...... except that somewhere in the middle a 1 is flipped to a zero e.g. ...101000101... and you are looking for a 10k bit sequence that ends in "...101000" then byte shift comparison algorithms will do poorly because they will have to compare a ton of bytes - 8 times - before failing the match a half a million times.

So the statistical characteristics of your data are important here. Also if the key string is used multiple times, or you expect multiple matches, then proprocessing the buffer can speed things up.

For example you could transform the buffer once to add up the number of bits in each pair of bytes, and in each byte, and then do that to each key string. You can then scan the two buffers. For the string to potentially match, the number of bits in each byte of the key string must always be smaller than the number of bits in each byte pair, and the number of bits in each byte pair of the key string must always be smaller than the bit count in each byte of the search string.

If your key string is large, then you could mark low and high bit "anchors" and scan for those. For example say I was comparing a 10k key string that had two 0 bytes at offsets x, y, z. Then I could scan my search string for positions the number of bits in single bytes at those offsets match zero. That would be really fast.

Rafael Baptista
  • 11,181
  • 5
  • 39
  • 59