18

Are there any efficient bitwise operations I can do to get the number of set bits that an integer ends with? For example 1110 = 10112 would be two trailing 1 bits. 810 = 10002 would be 0 trailing 1 bits.

Is there a better algorithm for this than a linear search? I'm implementing a randomized skip list and using random numbers to determine the maximum level of an element when inserting it. I am dealing with 32 bit integers in C++.

Edit: assembler is out of the question, I'm interested in a pure C++ solution.

Lazer
  • 90,700
  • 113
  • 281
  • 364
IVlad
  • 43,099
  • 13
  • 111
  • 179

12 Answers12

11

Calculate ~i & (i + 1) and use the result as a lookup in a table with 32 entries. 1 means zero 1s, 2 means one 1, 4 means two 1s, and so on, except that 0 means 32 1s.

Sparr
  • 7,489
  • 31
  • 48
Ignacio Vazquez-Abrams
  • 776,304
  • 153
  • 1,341
  • 1,358
  • 1
    This is yet another solution whose correctness I cannot determine at a glance. Could you link to or include working code, please? – Steven Sudit Mar 04 '10 at 16:25
  • This makes sense to me. Basically, by complementing `i` we get the 0 bit that follows the sequence of 1 bits at the end set to 1. By adding 1 to `i`, we get that same bit set to 1. By and-ing the two, we get a power of two, because everything else will be 0, correct? However, even if I keep `lookup[i] = ~i & (i + 1)`, how does that help me get the actual count I'm interested in? for a number `n`? I only get a power of two by using the lookup table, I still need a way to get its logarithm, don't I? So I'm not sure if this is better than the other solutions. – IVlad Mar 04 '10 at 16:36
  • i+1 toggles all trailing ones and the rightmost zero. ~i toggles all bits. & them together and you get a value having just that original rightmost zero as a one, all the rest zeros. The lookup table is for the position of that single one. – Sparr Mar 04 '10 at 16:45
  • and I guess I'll implement this one too... :) – Sparr Mar 04 '10 at 16:46
  • See answer below to get rid of the tables. – phkahler Mar 04 '10 at 16:52
  • Please do :). I'm not sure how you can use a size 32 lookup table and easily get the count. Binary searching as in the OP's implementation kind of reduces performance. – IVlad Mar 04 '10 at 16:53
  • I think he meant something like a hash/assoc instead of a normal array for the lookup table. In PHP thus: array(1->0,2->1,4->2,8->3,etc) – Sparr Mar 04 '10 at 22:14
  • A flat lookup table would take 4 gigabytes, and a binary tree would do just as well with `(i^i+1)` which is simpler. – Potatoswatter Mar 05 '10 at 04:28
  • @sparr, `^` is xor, and use "@" when replying to a comment. – Potatoswatter Mar 05 '10 at 10:10
8

The Bit Twiddling Hacks page has a number of algorithms for counting trailing zeros. Any of them can be adapted by simply inverting your number first, and there are probably clever ways to alter the algorithms in place without doing that as well. On a modern CPU with cheap floating point operations the best is probably thus:

unsigned int v=~input;            // find the number of trailing ones in input
int r;                     // the result goes here
float f = (float)(v & -v); // cast the least significant bit in v to a float
r = (*(uint32_t *)&f >> 23) - 0x7f;
if(r==-127) r=32;
Sparr
  • 7,489
  • 31
  • 48
8

Taking the answer from Ignacio Vazquez-Abrams and completing it with the count rather than a table:

b = ~i & (i+1);   // this gives a 1 to the left of the trailing 1's
b--;              // this gets us just the trailing 1's that need counting
b = (b & 0x55555555) + ((b>>1) & 0x55555555);  // 2 bit sums of 1 bit numbers
b = (b & 0x33333333) + ((b>>2) & 0x33333333);  // 4 bit sums of 2 bit numbers
b = (b & 0x0f0f0f0f) + ((b>>4) & 0x0f0f0f0f);  // 8 bit sums of 4 bit numbers
b = (b & 0x00ff00ff) + ((b>>8) & 0x00ff00ff);  // 16 bit sums of 8 bit numbers
b = (b & 0x0000ffff) + ((b>>16) & 0x0000ffff); // sum of 16 bit numbers

at the end b will contain the count of 1's (the masks, adding and shifting count the 1's). Unless I goofed of course. Test before use.

phkahler
  • 5,687
  • 1
  • 23
  • 31
  • Very nice and no lookup table at all. Thanks! – IVlad Mar 04 '10 at 16:54
  • 1
    However, is this better than the other solutions? Other than using less memory, the 256 ints lookup table solution executes less operations I think. It's hard to test since they are all very fast anyway, but which one do you think would win on a very slow computer? – IVlad Mar 04 '10 at 17:01
  • It should be faster than any solution with a loop or conditional, but you really need to test to know for sure. – phkahler Mar 04 '10 at 17:59
  • The loop for the lookup solution could be unrolled, but I think you're stuck with the conditional. – Steven Sudit Mar 04 '10 at 21:06
  • 2
    faster is very platform dependent. give me an ARM with conditional execution any day. for the uninitiated, almost every arm operation can be conditional. instead of having 16 branch commands (==0, <0, >=0, <=, <, >, >=, etc...) you have 4 bits of the instruction word dedicated to giving those same conditionals to everything. So there is just one branch operation, and you can make it a Branch If Less Than Or Equal with a modifier, but the same modifier can turn the Add operator into in Add If Less Than Or Equal, without any processing overhead. – Sparr Mar 05 '10 at 05:27
  • You can also drop some of the masks since you know certain additions won't overflow. i.e. the last 2 lines are just adding the bytes in b, none of which contain a value over 8. Again - test ;-) – phkahler Mar 05 '10 at 15:17
  • @Sparr: Interesting. I can certainly see how avoiding branches could speed things up. – Steven Sudit Mar 06 '10 at 01:47
6

GCC has __builtin_ctz and other compilers have their own intrinsics. Just protect it with an #ifdef:

#ifdef __GNUC__
int trailingones( uint32_t in ) {
    return ~ in == 0? 32 : __builtin_ctz( ~ in );
}
#else
// portable implementation
#endif

On x86, this builtin will compile to one very fast instruction. Other platforms might be somewhat slower, but most have some kind of bit-counting functionality that will beat what you can do with pure C operators.

Potatoswatter
  • 134,909
  • 25
  • 265
  • 421
2

There may be better answers available, particularly if assembler isn't out of the question, but one viable solution would be to use a lookup table. It would have 256 entries, each returning the number of contiguous trailing 1 bits. Apply it to the lowest byte. If it's 8, apply to the next and keep count.

Steven Sudit
  • 19,391
  • 1
  • 51
  • 53
  • Isn't the worst case of this the same though? If I understand correctly, I'd first check if the current byte has eight trailing bits, then seven, then six, etc. using the lookup table. Or do you mean something else? – IVlad Mar 04 '10 at 16:13
  • Something else! You'd look up the last byte, and if the answer is under 8, you're done. If not, repeat for the next to last byte and keep the total. You do a maximum of 4 lookups for a 32-bit number, with a simple test after each to determine whether to continue. – Steven Sudit Mar 04 '10 at 16:16
  • 1
    The worst case here is 4 lookups instead of 32 bitwise operations. – Sparr Mar 04 '10 at 16:16
  • To be clear, what I'm suggesting is still a linear search, but it's byte-by-byte, not bit-by-bit. It's still O(n), but with a smaller k. – Steven Sudit Mar 04 '10 at 16:17
  • Hmm, and technically, you don't need to keep the total. Your index will tell you how many bytes you've scanned. Multiply those by 8 and add the lookup of the stopping byte. – Steven Sudit Mar 04 '10 at 16:19
  • Some of the other answers are interesting and may be slightly faster or use a smaller table. My only concern is that, at a glance, I have no idea if they're correct. – Steven Sudit Mar 04 '10 at 16:22
  • I get it now, basically I look up every byte in the lookup table. I was thinking about something else initially. This does improve things, I'll consider it, thanks. – IVlad Mar 04 '10 at 16:30
  • @Steven Sudit - It isn't O(n), it is O(1) because you never have to do more than four lookups for any possible input (in this example). If you extended the problem to handle numbers of any size, then it is O(log n) because you are not dealing with the number, but the number of bytes needed to represent the number. – RAL Mar 04 '10 at 16:42
  • @Robert: I meant n as the number of bytes. In this case, n = k, so it's O(1). If we were doing this on an arbitrarily long sequence of bytes, then O(n) would make more sense. – Steven Sudit Mar 04 '10 at 16:58
2

Implementing Steven Sudit's idea...

uint32_t n; // input value
uint8_t o;  // number of trailing one bits in n

uint8_t trailing_ones[256] = {
0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0, 4, 
0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0, 5, 
0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0, 4, 
0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0, 6, 
0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0, 4, 
0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0, 5, 
0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0, 4, 
0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0, 7, 
0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0, 4, 
0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0, 5, 
0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0, 4, 
0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0, 6, 
0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0, 4, 
0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0, 5, 
0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0, 4, 
0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0, 8};

uint8_t t;
do {
  t=trailing_ones[n&255];
  o+=t;
} while(t==8 && (n>>=8))

1 (best) to 4 (worst) (average 1.004) times (1 lookup + 1 comparison + 3 arithmetic operations) minus one arithmetic operation.

Sparr
  • 7,489
  • 31
  • 48
  • That's a pretty fast implementation, and will work in a template for any size int. The one I had in mind would probably have been slower, as it was based on a byte* and a count, which adds flexibility at a cost. If the input type is fixed, the loop could be unrolled for more speed. – Steven Sudit Mar 04 '10 at 21:14
  • Also, in all cases, the final right-shift is unnecessary. Therefore, moving the while into the loop (by changing into into an if/break) would speed it up slightly. Unrolling would be a bigger win, though. I wonder if the unrolling could be done by a template instead of by hand... – Steven Sudit Mar 04 '10 at 21:15
  • 1
    moved the shift, short circuiting should take care of it, AND now I save a loop iteration when the answer is a multiple of 8. – Sparr Mar 04 '10 at 22:25
0

This code counts the number of trailing zero bits, taken from here (there's also a version that depends on the IEEE 32 bit floating point representation, but I wouldn't trust it, and the modulus/division approaches look really slick - also worth a try):

int CountTrailingZeroBits(unsigned int v) // 32 bit
{
    unsigned int c = 32; // c will be the number of zero bits on the right

    static const unsigned int B[] = {0x55555555, 0x33333333, 0x0F0F0F0F, 0x00FF00FF, 0x0000FFFF};
    static const unsigned int S[] = {1, 2, 4, 8, 16}; // Our Magic Binary Numbers

    for (int i = 4; i >= 0; --i) // unroll for more speed
    {
        if (v & B[i])
        {
            v <<= S[i];
            c -= S[i];
        }
    }

    if (v)
    {
        c--;
    }

    return c;
}

and then to count trailing ones:

int CountTrailingOneBits(unsigned int v)
{
    return CountTrailingZeroBits(~v);
}
plinth
  • 48,267
  • 11
  • 78
  • 120
0

Blazingly fast ways to find the number of trailing 0's are given in Hacker's Delight.

You could complement your integer (or more generally, word) to find the number of trailing 1's.

Håvard S
  • 23,244
  • 8
  • 61
  • 72
0

http://graphics.stanford.edu/~seander/bithacks.html might give you some inspiration.

Patrick
  • 23,217
  • 12
  • 67
  • 130
0

Implementation based on Ignacio Vazquez-Abrams's answer

uint8_t trailing_ones(uint32_t i) {
 return log2(~i & (i + 1));
}

Implementation of log2() is left as an exercise for the reader (see here)

Sparr
  • 7,489
  • 31
  • 48
0

Taking @phkahler's answer you can define the following preprocessor statement:

#define trailing_ones(x) __builtin_ctz(~x & (x + 1))

As you get a one left to all the prior ones, you can simply count the trailing zeros.

borchero
  • 5,562
  • 8
  • 46
  • 72
-2

I have this sample for you :

#include <stdio.h>

int trailbits ( unsigned int bits, bool zero )
{
    int bitsize = sizeof(int) * 8;
    int len = 0;
    int trail = 0;
    unsigned int compbits = bits;

    if ( zero ) compbits = ~bits;

    for ( ; bitsize; bitsize-- )
    {
        if ( compbits & 0x01 ) trail++;
        else
        {
            if ( trail > 1 ) len++;
            trail = 0;
        }
        compbits = compbits >> 1;
    }

    if ( trail > 1 ) len++;

    return len;
}

void PrintBits ( unsigned int bits )
{
    unsigned int pbit = 0x80000000;
    for ( int len=0 ; len<32; len++ )
    {
        printf ( "%c ", pbit & bits ? '1' : '0' ); 
        pbit = pbit >> 1;
    }
    printf ( "\n" );
}

void main(void)
{
    unsigned int forbyte = 0x0CC00990;

    PrintBits ( forbyte );

    printf ( "Trailing ones is %d\n", trailbits ( forbyte, false ));
    printf ( "Trailing zeros is %d\n", trailbits ( forbyte, true ));
}
lsalamon
  • 7,998
  • 6
  • 50
  • 63
  • I believe the reason you got downvoted is that your solution is essentially the bitwise linear search that the OP already had and was trying to improve upon. – Steven Sudit Mar 04 '10 at 21:17
  • A proof of concept would show who has better performance. Perfection is not the complexity. – lsalamon Mar 05 '10 at 12:11
  • I'm going to take a guess and say that your implementation is slower than Sparr's. – Steven Sudit Mar 05 '10 at 21:53
  • The algorithm is exposed as not working, at least I can not make it work. – lsalamon Mar 06 '10 at 17:37
  • No, the OP didn't say their algorithm didn't work. They just wanted something better than bitwise searching. But all you provided was an example of bitwise searching. – Steven Sudit Mar 08 '10 at 08:19