"Counting number of bits set"- why this algorithm?

Question

I came across this algorithm submitted here:How does this algorithm to count the number of set bits in a 32-bit integer work?

I ran the algorithm submitted by "yeer" in the link above as both the algorithms more or less looked the same.

I wrote the traditional(slower method,assuming) code to check how much performance has improved.

Yeer Code:

unsigned int number=0xFFFFFFFF;

number= (number & 0x55555555) + ((number>>1) & 0x55555555);
number= (number & 0x33333333) + ((number>>2) & 0x33333333);
number= (number & 0x0F0F0F0F) + ((number>>4) & 0x0F0F0F0F);
number= (number & 0x00FF00FF) + ((number>>8) & 0x00FF00FF);
number= (number & 0x0000FFFF) + ((number>>16) & 0x0000FFFF);

printf("%d",number);

traditional way:

unsigned int number=0xFFFFFFFF;
unsigned char i=0;
unsigned char count=0;

for(i=0;i<32;i++)
{
 if((number>>i) & 1)
  count++;
}

 printf("%d",count);

The second code outbeats "yeers" method.

For input value 0xFF(using variable as unsigned char), Traditional= 0.047s, Other= 0.063s For input value 0xFFFFFFFF(using variable as unsigned int), Traditional= 0.141s, Other= 0.141s

Whats so special in the other algorithm?

I used Codeblocks IDE to run both the codes.

`count` is undefined and uninitialised. Please post real code. BTW if `number` (also undefined) is signed, the result/behavior is undefined. — joop, Oct 06 '16 at 10:07
We can only speculate about your benchmark setup. It's pretty non-trivial to measure the right timings, so a complete [mcve] is unavoidable for your question. — BeyelerStudios, Oct 06 '16 at 10:08
Try running each one about a million times and see if you get similar results. — dbush, Oct 06 '16 at 10:08
If your "traditional way" is the actual code you tested, the results are meaningless regardless of how many times you run it. My compiler optimizes that code down to `printf("%d", 32);` — Art, Oct 06 '16 at 10:44
Also, if you're timing that actual code you posted, you're just measuring printf which does a few magnitudes more work than the bit counting. — Art, Oct 06 '16 at 10:50
(_Never_ put even minutes of thought into performance results from test runs not even taking seconds.) — greybeard, Oct 06 '16 at 11:07

score 2 · Accepted Answer · answered Oct 06 '16 at 11:08

I ran a simple benchmark test for the diffeent codes. Each snippet was executed 100 million times. The code was compiled gcc with no optmization flags. Each case was exceuted a couple of times to make sure the result was not distorted unduly by other system activity. The execution times varied less than 10% when re-run.

As you can see, the algorithm submitted by yeer is much faster than the other algorithms.

Test driver:

int main(int argc, char *argv[]) {
    int i, result;
    for (i= 0; i < 100000000; i++) {
        result = SWAR(i);
    }
    return 0;
}

Code 1, the optimized yeer code:

int SWAR(unsigned int i) {
    i = i - ((i >> 1) & 0x55555555);
    i = (i & 0x33333333) + ((i >> 2) & 0x33333333);
    return (((i + (i >> 4)) & 0x0F0F0F0F) * 0x01010101) >> 24;
}

Time: 0.772s

Code 2, the un-optimized version:

int SWAR(unsigned int number) {
    number= (number & 0x55555555) + ((number>>1) & 0x55555555);
    number= (number & 0x33333333) + ((number>>2) & 0x33333333);
    number= (number & 0x0F0F0F0F) + ((number>>4) & 0x0F0F0F0F);
    number= (number & 0x00FF00FF) + ((number>>8) & 0x00FF00FF);
    number= (number & 0x0000FFFF) + ((number>>16) & 0x0000FFFF);
    return number;
}

Time: 1.241s

Code 3, bit counter without if:

int SWAR(unsigned int number) {
  int i, count = 0;
  for(i=0;i<32;i++) {
    count += (number>>i) & 1;
  }
    return count;
}

Time: 8.921s

Code 4, bit counter with if:

int SWAR(unsigned int number) {
  int i, count = 0;
  for(i=0;i<32;i++) {
    if ((number>>i) & 1) {
        count++;
    }
  }
    return count;
}

Time: 21.058s

This improvement can be seen only if 100000000(many) inputs have to be checked. If the function is called for a single input, there isnt any performance improvement(minute). Hence this algorithm should definitely be used if a large number of inputs is to be processed. Thanks. — AlphaGoku, Oct 06 '16 at 11:19

Secto Kia · Answer 2 · 2016-10-06T10:45:17.527

1

The first method has no branching and on most systems will result in about 15 machine code cpu instructions (5 adda, 5 shifts and 5 and's).

You method has typically 128 instructions (32 lots of 4 instructions) and even with predictive branching most cpus will have to dump their pipeline at least once when the loop condition is wrongly estimated, resulting in +130 cpu cycles to run get a result.

I would suggest you try running it millions of time on random data and you will see the difference.

Try a for loop of 1 to a million over an array that has been set with data from rand()

Your times are probably other things and nothing to do with your code which would execute in microseconds

edited Oct 06 '16 at 10:45

answered Oct 06 '16 at 10:43

Secto Kia

990
4
12

1

OP is asking why his method is faster , whereas your answer says that OP's method should be slower – M.M Oct 06 '16 at 10:44
1

15 machine cycles will be a few nanoseconds on a modern CPU, so approx. 99.99999% of the running time measured will be spent on other stuff unrelated to the code to be benchmarked. – biziclop Oct 06 '16 at 10:55

"Counting number of bits set"- why this algorithm?

2 Answers2