efficiently find the first element matching a bit mask

Question

I have a list of N 64-bit integers whose bits represent small sets. Each integer has at most k bits set to 1. Given a bit mask, I would like to find the first element in the list that matches the mask, i.e. element & mask == element.

Example:

If my list is:

index abcdef
  0   001100
  1   001010
  2   001000
  3   000100
  4   000010
  5   000001
  6   010000
  7   100000
  8   000000

and my mask is 111000, the first element matching the mask is at index 2.

Method 1:

Linear search through the entire list. This takes O(N) time and O(1) space.

Method 2:

Precompute a tree of all possible masks, and at each node keep the answer for that mask. This takes O(1) time for the query, but takes O(2^64) space.

Question:

How can I find the first element matching the mask faster than O(N), while still using a reasonable amount of space? I can afford to spend polynomial time in precomputation, because there will be a lot of queries. The key is that k is small. In my application, k <= 5 and N is in the thousands. The mask has many 1s; you can assume that it is drawn uniformly from the space of 64-bit integers.

Update:

Here is an example data set and a simple benchmark program that runs on Linux: http://up.thirld.com/binmask.tar.gz. For large.in, N=3779 and k=3. The first line is N, followed by N unsigned 64-bit ints representing the elements. Compile with make. Run with ./benchmark.e >large.out to create the true output, which you can then diff against. (Masks are generated randomly, but the random seed is fixed.) Then replace the find_first() function with your implementation.

The simple linear search is much faster than I expected. This is because k is small, and so for a random mask, a match is found very quickly on average.

Would it be possible to sort the list of elements (recording the original indices)? It's *much* easier if they're sorted. Here are a couple places to start in thinking about this issue: http://en.wikipedia.org/wiki/List_of_algorithms#Item_search and http://en.wikipedia.org/wiki/Selection_algorithm (check out the "Using_data_structures_to_select_in_sublinear_time" section). — Joe Bane, Feb 12 '12 at 03:20
Sure, we can sort the list and store the original indices. Then the question becomes: Find the element that matches the mask and has the smallest index. I still don't see how to do it with any traditional search algorithm though, because we're looking for elements matching a mask, rather than looking for one particular element. — cberzan, Feb 12 '12 at 03:41
Method 2 takes `O(2^64)` space, which is constant but impractical. — Rex Kerr, Feb 12 '12 at 07:22
To the OP: could you please make available a test dataset? A few hundred items is probably enough. The search masks can be random, I guess. TIA — wildplasser, Feb 13 '12 at 12:20
When you say "How can I find the first element matching the mask..." do you mean the first element you stumble across that matches the mask or the first element in the original list that matches the mask? (The former is often much easier than the latter.) — Kaganar, Feb 15 '12 at 14:50
@Kaganar: The first element in the original list that matches the mask. You can think of an element's index in the original list as the priority for that element (the smaller the better). — cberzan, Feb 15 '12 at 15:03

score 3 · Answer 1 · answered Feb 12 '12 at 07:41

A suffix tree (on bits) will do the trick, with the original priority at the leaf nodes:

000000 -> 8
     1 -> 5
    10 -> 4
   100 -> 3
  1000 -> 2
    10 -> 1
   100 -> 0
 10000 -> 6
100000 -> 7

where if the bit is set in the mask, you search both arms, and if not, you search only the 0 arm; your answer is the minimum number you encounter at a leaf node.

You can improve this (marginally) by traversing the bits not in order but by maximum discriminability; in your example, note that 3 elements have bit 2 set, so you would create

2:0 0:0 1:0 3:0 4:0 5:0 -> 8
                    5:1 -> 5
                4:1 5:0 -> 4
            3:1 4:0 5:0 -> 3
        1:1 3:0 4:0 5:0 -> 6
    0:1 1:0 3:0 4:0 5:0 -> 7
2:1 0:0 1:0 3:0 4:0 5:0 -> 2
                4:1 5:0 -> 1
            3:1 4:0 5:0 -> 0

In your example mask this doesn't help (since you have to traverse both the bit2==0 and bit2==1 sides since your mask is set in bit 2), but on average it will improve the results (but at a cost of setup and more complex data structure). If some bits are much more likely to be set than others, this could be a huge win. If they're pretty close to random within the element list, then this doesn't help at all.

If you're stuck with essentially random bits set, you should get about (1-5/64)^32 benefit from the suffix tree approach on average (13x speedup), which might be better than the difference in efficiency due to using more complex operations (but don't count on it--bit masks are fast). If you have a nonrandom distribution of bits in your list, then you could do almost arbitrarily well.

wildplasser · Answer 2 · 2012-02-18T12:59:25.733

This is the bitwise Kd-tree. It typically needs less than 64 visits per lookup operation. Currently, the selection of the bit (dimension) to pivot on is random.

#include <limits.h>
#include <time.h>
#include <stdlib.h>
#include <stdio.h>
#include <string.h>

typedef unsigned long long Thing;
typedef unsigned long Number;

unsigned thing_ffs(Thing mask);
Thing rand_mask(unsigned bitcnt);

#define WANT_RANDOM 31
#define WANT_BITS 3

#define BITSPERTHING (CHAR_BIT*sizeof(Thing))
#define NONUMBER ((Number)-1)

struct node {
        Thing value;
        Number num;
        Number nul;
        Number one;
        char pivot;
        } *nodes = NULL;
unsigned nodecount=0;
unsigned itercount=0;

struct node * nodes_read( unsigned *sizp, char *filename);
Number *find_ptr_to_insert(Number *ptr, Thing value, Thing mask);

unsigned grab_matches(Number *result, Number num, Thing mask);
void initialise_stuff(void);

int main (int argc, char **argv)
{
Thing mask;
Number num;
unsigned idx;

srand (time(NULL));
nodes = nodes_read( &nodecount, argv[1]);
fprintf( stdout, "Nodecount=%u\n", nodecount );
initialise_stuff();

#if WANT_RANDOM
mask = nodes[nodecount/2].value | nodes[nodecount/3].value ;
#else
mask = 0x38;
#endif

fprintf( stdout, "\n#### Search mask=%llx\n", (unsigned long long) mask );

itercount = 0;
num = NONUMBER;
idx = grab_matches(&num,0, mask);
fprintf( stdout, "Itercount=%u\n", itercount );

fprintf(stdout, "KdTree search  %16llx\n", (unsigned long long) mask );
fprintf(stdout, "Count=%u Result:\n", idx);
idx = num;
if (idx >= nodecount) idx = nodecount-1;
fprintf( stdout, "num=%4u Value=%16llx\n"
        ,(unsigned) nodes[idx].num
        ,(unsigned long long) nodes[idx].value
        );

fprintf( stdout, "\nLinear search  %16llx\n", (unsigned long long) mask );
for (idx = 0; idx < nodecount; idx++) {
        if ((nodes[idx].value & mask) == nodes[idx].value) break;
        }
fprintf(stdout, "Cnt=%u\n", idx);
if (idx >= nodecount) idx = nodecount-1;
fprintf(stdout, "Num=%4u Value=%16llx\n"
        , (unsigned) nodes[idx].num
        , (unsigned long long) nodes[idx].value );

return 0;
}

void initialise_stuff(void)
{
unsigned num;
Number root, *ptr;
root = 0;

for (num=0; num < nodecount; num++) {
        nodes[num].num = num;
        nodes[num].one = NONUMBER;
        nodes[num].nul = NONUMBER;
        nodes[num].pivot = -1;
        }
nodes[num-1].value = 0; /* last node is guaranteed to match anything */

root = 0;
for (num=1; num < nodecount; num++) {
        ptr = find_ptr_to_insert (&root, nodes[num].value, 0ull );
        if (*ptr == NONUMBER) *ptr = num;
        else fprintf(stderr, "Found %u for %u\n"
                , (unsigned)*ptr, (unsigned) num );
        }
}

Thing rand_mask(unsigned bitcnt)
{struct node * nodes_read( unsigned *sizp, char *filename)
{
struct node *ptr;
unsigned size,used;
FILE *fp;

if (!filename) {
        size = (WANT_RANDOM+0) ? WANT_RANDOM : 9;
        ptr = malloc (size * sizeof *ptr);
#if (!WANT_RANDOM)
        ptr[0].value = 0x0c;
        ptr[1].value = 0x0a;
        ptr[2].value = 0x08;
        ptr[3].value = 0x04;
        ptr[4].value = 0x02;
        ptr[5].value = 0x01;
        ptr[6].value = 0x10;
        ptr[7].value = 0x20;
        ptr[8].value = 0x00;
#else
        for (used=0; used < size; used++) {
                ptr[used].value = rand_mask(WANT_BITS);
                }
#endif /* WANT_RANDOM */
        *sizp = size;
        return ptr;
        }

fp = fopen( filename, "r" );
if (!fp) return NULL;
fscanf(fp,"%u\n",  &size );
fprintf(stderr, "Size=%u\n", size);
ptr = malloc (size * sizeof *ptr);
for (used = 0; used < size; used++) {
        fscanf(fp,"%llu\n",  &ptr[used].value );
        }

fclose( fp );
*sizp = used;
return ptr;
}

Thing value = 0;
unsigned bit, cnt;

for (cnt=0; cnt < bitcnt; cnt++) {
        bit = 54321*rand();
        bit %= BITSPERTHING;
        value |= 1ull << bit;
        }
return value;
}

Number *find_ptr_to_insert(Number *ptr, Thing value, Thing done)
{
Number num=NONUMBER;

while ( *ptr != NONUMBER) {
        Thing wrong;

        num = *ptr;
        wrong = (nodes[num].value ^ value) & ~done;
        if (nodes[num].pivot < 0) { /* This node is terminal */
                /* choose one of the wrong bits for a pivot .
                ** For this bit (nodevalue==1 && searchmask==0 )
                */
                if (!wrong) wrong = ~done ;
                nodes[num].pivot  = thing_ffs( wrong );
                }
        ptr = (wrong & 1ull << nodes[num].pivot) ? &nodes[num].nul : &nodes[num].one;
        /* Once this bit has been tested, it can be masked off. */
        done |= 1ull << nodes[num].pivot ;
        }
return ptr;
}

unsigned grab_matches(Number *result, Number num, Thing mask)
{
Thing wrong;
unsigned count;

for (count=0; num < *result; ) {
        itercount++;
        wrong = nodes[num].value & ~mask;
        if (!wrong) { /* we have a match */
                if (num < *result) { *result = num; count++; }
                /* This is cheap pruning: the break will omit both subtrees from the results.
                ** But because we already have a result, and the subtrees have higher numbers
                ** than our current num, we can ignore them. */
                break;
                }
        if (nodes[num].pivot < 0) { /* This node is terminal */
                break;
                }
        if (mask & 1ull << nodes[num].pivot) {
                /* avoid recursion if there is only one non-empty subtree */
                if (nodes[num].nul >= *result) { num = nodes[num].one; continue; }
                if (nodes[num].one >= *result) { num = nodes[num].nul; continue; }
                count += grab_matches(result, nodes[num].nul, mask);
                count += grab_matches(result, nodes[num].one, mask);
                break;
                }
        mask |= 1ull << nodes[num].pivot;
        num = (wrong & 1ull << nodes[num].pivot) ? nodes[num].nul : nodes[num].one;
        }
return count;
}

unsigned thing_ffs(Thing mask)
{
unsigned bit;

#if 1
if (!mask) return (unsigned)-1;
for ( bit=random() % BITSPERTHING; 1 ; bit += 5, bit %= BITSPERTHING) {
        if (mask & 1ull << bit ) return bit;
        }
#elif 0
for (bit =0; bit < BITSPERTHING; bit++ ) {
        if (mask & 1ull <<bit) return bit;
        }
#else
mask &= (mask-1); // Kernighan-trick
for (bit =0; bit < BITSPERTHING; bit++ ) {
        mask >>=1;
        if (!mask) return bit;
        }
#endif

return 0xffffffff;
}

struct node * nodes_read( unsigned *sizp, char *filename)
{
struct node *ptr;
unsigned size,used;
FILE *fp;

if (!filename) {
        size = (WANT_RANDOM+0) ? WANT_RANDOM : 9;
        ptr = malloc (size * sizeof *ptr);
#if (!WANT_RANDOM)
        ptr[0].value = 0x0c;
        ptr[1].value = 0x0a;
        ptr[2].value = 0x08;
        ptr[3].value = 0x04;
        ptr[4].value = 0x02;
        ptr[5].value = 0x01;
        ptr[6].value = 0x10;
        ptr[7].value = 0x20;
        ptr[8].value = 0x00;
#else
        for (used=0; used < size; used++) {
                ptr[used].value = rand_mask(WANT_BITS);
                }
#endif /* WANT_RANDOM */
        *sizp = size;
        return ptr;
        }

fp = fopen( filename, "r" );
if (!fp) return NULL;
fscanf(fp,"%u\n",  &size );
fprintf(stderr, "Size=%u\n", size);
ptr = malloc (size * sizeof *ptr);
for (used = 0; used < size; used++) {
        fscanf(fp,"%llu\n",  &ptr[used].value );
        }

fclose( fp );
*sizp = used;
return ptr;
}

UPDATE:

I experimented a bit with the pivot-selection, favouring bits with the highest discriminatory value ("information content"). This involves:

making a histogram of the usage of bits (can be done while initialising)
while building the tree: choosing the one with frequency closest to 1/2 in the remaining subtrees.

The result: the random pivot selection performed better.

Could you give some explanations how this algorithm is intended to work? I'm not sure it is possible to guess this just reading the code. Inserting several prints into this code, I noticed that "tree" is in fact a linear chain. And `grab_matches` is never called recursively... — Evgeny Kluev, Feb 16 '12 at 17:04
It attempts to avoid recursion. Did you use the real test data ? BTW, before posting, I stripped all debugging-fprintf()s ;-) The algorithm is very similar to Elkamina, above. But it chooses its pivot bit from the set of bits that differ between the node and the mask_at_test (that is the XOR & mask bit). In fact a Kd-tree, with one-bit wide dimensions. BTW: did you use the real test data from file, or only the built-ins ? — wildplasser, Feb 16 '12 at 17:10
The OP posted some test data. Link is in the comments under the OP. EDIT: not in the comments, but **in** the OP. — wildplasser, Feb 16 '12 at 17:17
Sorry, didn't notice it. Very rarely it recurses 1 or 2 times. — Evgeny Kluev, Feb 16 '12 at 17:17
In most cases, there is only one child. It only recurses if there are two. (maybe it helps if you rename the {nul,one} members into {prev,next} ;-) — wildplasser, Feb 16 '12 at 17:20

score 1 · Answer 3 · answered Feb 12 '12 at 07:00

1

Construct a a binary tree as follows:

Every level corresponds to a bit
It corresponding bit is on go right, otherwise left

This way insert every number in the database.

Now, for searching: if the corresponding bit in the mask is 1, traverse both children. If it is 0, traverse only the left node. Essentially keep traversing the tree until you hit the leaf node (BTW, 0 is a hit for every mask!).

This tree will have O(N) space requirements.

Eg of tree for 1 (001), 2(010) and 5 (101)

         root
        /    \
       0      1
      / \     |
     0   1    0
     |   |    |
     1   0    1
    (1) (2)  (5)

answered Feb 12 '12 at 07:00

ElKamina

7,747
28
43

is this really better than linear search? what about requirement to find the *first* element in the list? – max taldykin Feb 12 '12 at 07:14
@maxtaldykin It is MUCH better than the linear search. I am nit sure about the first element requirement though. I thought those numbers were sorted (in which case my algorithm works). – ElKamina Feb 12 '12 at 07:22
2

@ElKamina, to find match with the smallest index you need to store indexes in the leafs. After finding first match you need to keep traversing tree to find minimal index. It seems not MUCH faster than O(N). – max taldykin Feb 12 '12 at 09:49

wildplasser · Answer 4 · 2012-02-13T00:01:37.270

With precomputed bitmasks. Formally is is still O(N), since the and-mask operations are O(N). The final pass is also O(N), because it needs to find the lowest bit set, but that could be sped up, too.

#include <limits.h>
#include <stdlib.h>
#include <stdio.h>
#include <string.h>

  /* For demonstration purposes.
  ** In reality, this should be an unsigned long long */
typedef unsigned char Thing;

#define BITSPERTHING (CHAR_BIT*sizeof (Thing))
#define COUNTOF(a) (sizeof a / sizeof a[0])

Thing data[] =
/****** index abcdef */
{ 0x0c /* 0   001100 */
, 0x0a /* 1   001010 */
, 0x08 /* 2   001000 */
, 0x04 /* 3   000100 */
, 0x02 /* 4   000010 */
, 0x01 /* 5   000001 */
, 0x10 /* 6   010000 */
, 0x20 /* 7   100000 */
, 0x00 /* 8   000000 */
};

        /* Note: this is for demonstration purposes.
        ** Normally, one should choose a machine wide unsigned int
        ** for bitmask arrays.
        */
struct bitmap {
        char data[ 1+COUNTOF (data)/ CHAR_BIT ];
        } nulmaps [ BITSPERTHING ];

#define BITSET(a,i) (a)[(i) / CHAR_BIT ] |= (1u <<  ((i)%CHAR_BIT) )
#define BITTEST(a,i) ((a)[(i) / CHAR_BIT ] & (1u <<  ((i)%CHAR_BIT) ))

void init_tabs(void);
void map_empty(struct bitmap *dst);
void map_full(struct bitmap *dst);
void map_and2(struct bitmap *dst, struct bitmap *src);

int main (void)
{
Thing mask;
struct bitmap result;
unsigned ibit;

mask = 0x38;
init_tabs();
map_full(&result);

for (ibit = 0; ibit < BITSPERTHING; ibit++) {
        /* bit in mask is 1, so bit at this position is in fact a don't care */
        if (mask & (1u <<ibit))  continue;
        /* bit in mask is 0, so we can only select items with a 0 at this bitpos */
        map_and2(&result, &nulmaps[ibit] );
        }

        /* This is not the fastest way to find the lowest 1 bit */
for (ibit = 0; ibit < COUNTOF (data); ibit++) {
        if (!BITTEST(result.data, ibit) ) continue;
        fprintf(stdout, " %u", ibit);
        }
fprintf( stdout, "\n" );
return 0;
}

void init_tabs(void)
{
unsigned ibit, ithing;

        /* 1 bits in data that dont overlap with 1 bits in the searchmask are showstoppers.
        ** So, for each bitpos, we precompute a bitmask of all *entrynumbers* from data[], that contain 0 in bitpos.
        */
memset(nulmaps, 0 , sizeof nulmaps);
for (ithing=0; ithing < COUNTOF(data); ithing++) {
        for (ibit=0; ibit < BITSPERTHING; ibit++) {
                if ( data[ithing] & (1u << ibit) ) continue;
                BITSET(nulmaps[ibit].data, ithing);
                }
        }
}

        /* Logical And of two bitmask arrays; simular to dst &= src */
void map_and2(struct bitmap *dst, struct bitmap *src)
{
unsigned idx;
for (idx = 0; idx < COUNTOF(dst->data); idx++) {
        dst->data[idx] &= src->data[idx] ;
        }
}

void map_empty(struct bitmap *dst)
{
memset(dst->data, 0 , sizeof dst->data);
}

void map_full(struct bitmap *dst)
{
unsigned idx;
        /* NOTE this loop sets too many bits to the left of COUNTOF(data) */
for (idx = 0; idx < COUNTOF(dst->data); idx++) {
        dst->data[idx] = ~0;
        }
}

Since mask is `drawn uniformly from the space of 64-bit integers`, this method should give 2x speed up on average. — Evgeny Kluev, Feb 13 '12 at 09:31
The number of bit operations could be further reduced by creating precomputed bitmasks for more than one bit at a time, at the expense of more memory, eg 4bit nibbles: 16*16 bitmasks, each covering 1/16 (4bits) of the 64 bit words. The bitmasks would also become less dense, which might be exploited by choosing a different representation for the sets. It is probablly faster than the naive seq search, but formally still O(N). — wildplasser, Feb 13 '12 at 09:51
This algorithm retains all good properties of simple linear search. It is simple, linear, predictable, vectorizable and cache-friendly. As for further optimization, 4bit nibbles do not look very promising. I believe, using all possible 2-bit permutations will give another 2x speed-up, taking only 30x more memory (may be still in cache). All 3,4-bit permutations give 4x more speed theoretically, but limited memory bandwidth spoils it. I wonder if it's possible to invent better bit combinations... — Evgeny Kluev, Feb 13 '12 at 10:12
My idea started with two-bit subnibbles, too. In any case, the resulting bitsets are the same size (N bits) , only more sparse. Exploiting the sparsety could lead to some memory and bandwidth savings. The pairwise and-ing could be paralelised, yes. — wildplasser, Feb 13 '12 at 10:30
One little thing in your program looks like a bug: map_and2. I think, it should be map_or2 instead. And the "sparsety" would decrease after several ors. Data would be not very compressible. By the way, do you plan to add some explanation for your algorithm? I understood it only because I was thinking in the same direction at the moment. It may be not clear enough for unprepared reader. — Evgeny Kluev, Feb 13 '12 at 11:07
No, it should be an and, IMHO. Maybe you are confused because it works on the negated bitmask? The nulbit array contains 1 bits for the entries that have a 0 bit in the given bitposition. (1 bits outside the searchmask are showstoppers) — wildplasser, Feb 13 '12 at 11:16
Yes, I didn't notice negated bitmask. Speaking about "sparsety", it would still decrease after several ands. — Evgeny Kluev, Feb 13 '12 at 11:26
4bit nibbles optimization is in fact not bad, I was not right. Actually, 8*256 bitmasks are even better: with only 32x more memory, they require only 7 logical operations and give 9x speed improvement. — Evgeny Kluev, Feb 13 '12 at 18:54
BTW: I am working on a better solution with K-d trees. Pretty hard, because the mask functions as a wildcard. Still looks promising. — wildplasser, Feb 14 '12 at 10:28
I also considered using AD-trees from machine learning (see A. Moore, M. S. Lee - "Cached Sufficient Statistics for Efficient Machine Learning with Large Datasets"), but I couldn't figure out how to extend it to searching with a mask. — cberzan, Feb 14 '12 at 14:53

efficiently find the first element matching a bit mask

4 Answers4

Linked