When to use parallel counting - MIT HAKMEM for bitcount when memory is an issue?

Question

Bitcounting can be done in several ways, eg. with set bit iterator, unset bit iterator, pre-computed bits with lookup tables or parallel counting. As I have figured out by searching the web, unset bit iterator is fast when there are less unset bits, and set bit iterator the opposite. But when should you use parallel counting, MIT HAKMEM (seen below) in particular? It seems quite fast, although probably slower then lookup tables. Is it always better compared to set/unset bit in terms of speed? Are there some other conserns regarding which one to choose than speed and memory?

 int BitCount(unsigned int u) {
     unsigned int uCount;

     uCount = u - ((u >> 1) & 033333333333) - ((u >> 2) & 011111111111);
     return ((uCount + (uCount >> 3)) & 030707070707) % 63;
 }

You might also want to check out [Bit Twiddling Hacks](http://graphics.stanford.edu/~seander/bithacks.html#CountBitsSetParallel) for a few other methods — Hasturkun, Dec 21 '11 at 13:29
Here is another site specializing in 64-bit bitmaps. http://chessprogramming.wikispaces.com/Population+Count has lot of different methods. What is best depends on things like the average number of expected bits set. — Bo Persson, Dec 21 '11 at 16:36
one method you forgot is the `POPCNT` SSE4a instruction: http://msdn.microsoft.com/en-us/library/bb385231.aspx (now if only this was standardized...) — Necrolis, Jan 10 '12 at 06:40

Ze Blob · Accepted Answer · 2012-01-10T06:19:22.363

Why choose one bit counting method over another? Well it really depends on your machine and the problem you're trying to solve. Note that all the instruction counts I give below are for a basic RISC processor and might not translate well to a more complicated beast like x86.

The HAKMEM algorithm you quoted will execute in 13 instructions but is unlikely to be very fast due to the modulus operator. By eye-balling it, it does look like it has some pretty good instruction level parallelism which should help if your processor is capable of exploiting that.

The algorithm Bo Persson presented is quite fast (2 + 5*pop(x) instructions) but only if the word is sparsely populated. It can also be modified to work on densely populated words. It also contain branches and doesn't have any significant instruction level parallelism.

EDIT: The table lookup method can also be very fast but does make memory accesses. If the entire table is in the L1 cache then it's probably one of the fastest algorithms. If the table isn't in cache then it's almost certainly one of the slowest.

The algorithm below is a variation of one of the HAKMEM algorithm and is presented in the book Hacker's Delight (I highly recommend this book if you like this sort of things). It executes in 19 instructions and is branch-free. It also doesn't use a division but does have a multiplication. It's also very economical in the way it uses registers by re-using the same mask as much as possible. Still no significant instruction level parallelism here that I can see.

int pop(unsigned x) {
  unsigned n;

  n = (x >> 1) & 0x77777777;
  x = x - n;
  n = (n >> 1) & 0x77777777;
  x = x - n;
  n = (n >> 1) & 0x77777777;
  x = x - n;
  x = (x + (x >> 4)) & 0x0F0F0F0F;
  x = x * 0x01010101;
  return x >> 24;
}

The Hacker's Delight book also presents a couple of even more specialised algorithms for 9-8-7 bit wide fields or using floating point operators. Note that most of the analysis I presented above were also partially taken from that book as well.

The fact is that there's a truck load of methods and the only way to be sure which works best in your particular situation is to measure and compare. I do realise that this is a pretty canned answer but the alternative is to know your processor and compiler inside out.

My preferred `popCount` *is* used for a chess program, where we have 64-bit bitmaps representing the pieces on the board. When counting things like number of checking pieces or number of dark pawns, the routine is very good on a 64-bit machine. — Bo Persson, Jan 11 '12 at 10:24

Bo Persson · Answer 2 · 2012-01-10T00:14:35.953

4

This is very simple, but amazingly fast if there are just a few bits set:

int popCount (U64 x) {
   int count = 0;
   while (x) {
       count++;
       x &= x - 1; // reset LS1B
   }
   return count;
}

From Chess programming wiki: Brian Kernighan's way

edited Jan 10 '12 at 00:14

answered Jan 10 '12 at 00:09

Bo Persson

90,663
31
146
203

Very nice. But unfortunately the loop count is variable. Unless the branch predictor manages to guess the population count correctly, you will get a very expensive pipeline flush. HAKMEM should be faster on a deeply pipelined machine, because it has no conditional branches. – Mackie Messer Jan 11 '12 at 01:13
@Mackie - when used in a chess program, `x` most often has *very* few bits set. In that case an inlined version of this function is smaller and more efficient than the HACKMEM version. It has been measured! :-) – Bo Persson Jan 11 '12 at 06:42
But who measured it and on which machine? Did you test it on your PC? I just did and on mine HAKMEM is _always_ faster. Also for sparsely populated words. YMMV. – Mackie Messer Jan 11 '12 at 16:09
I *love* this approach. Simple, straight-forward *and* fast. – Dr.Kameleon Dec 14 '12 at 06:07

Mackie Messer · Answer 3 · 2012-01-11T17:35:45.717

If you have a x86-CPU which supports SSE4 you already have a single instruction to do all the work: POPCNT. See Intel® SSE4 Programming Reference. (PDF, 698KB)

Another remark: Parallel algorithms like HAKMEM 169 do not contain conditional branches. This is a huge advantage. Modern processors predict the direction of conditional branches, but have trouble with random branching patterns. Mispredictions are very costly (cost can be more than the equivalent of 100 instructions). It is best to avoid loops with a variable loop count and/or conditional statements if performance is important. For more information:

Intel® 64 and IA-32 Architectures Optimization Reference Manual (PDF, 4.5MB)
Software Optimization Guide for AMD Family 15h Processors (PDF, 1.9MB)

I also second the recommendation of the Book Hackers Delight.

score 0 · Answer 4 · answered Jan 11 '12 at 00:45

How about something like the following:

From each raw word r, compute a pair-munged word as `w = r - (r & 0x55555555)`. Each pair of bits will hold the number of set bits in that pair (0-2).
Next, from each pair-munged word, compute a quad-munged word `q = (w & 0x33333333) + ((w >> 2) & 0x33333333)`. Each quartet of set bits will hold the number of set bits in that quartet (0-4).
Sum together the quad-munged words in groups of two, and from each sum `s` compute an octet-munged sum `o = (s & 0x0F0F0F0F) + ((s >> 4) & 0x0F0F0F0F)`. Each octet will hold the total number of set bits in the corresponding octet in both input words (0-16).
Sum together octet-munged sums in groups of eight, and from each sum `s` compute a halfword-munged sum `h = (s & 0x00FF00FF) + ((s >> 8) & 0x00FF00FF)`. Each halfword will contain a total number of set bits in the corresponding halfword of all 16 input words (0-256).
Sum together the halfword-munged sums in groups of 128, and from each sum `s` compute a total sum `t = (s & 0x0000FFFF) + ((s >> 16) & 0xFFFF)`. This will hold the number of set bits in 1024 input words (0-32768)

Note that two of the steps are performed for every word, one for every other word, one for every sixteen, and one for every 1024. If words are 64 bits instead of 32, an extra step would be necessary for the final sum, but the word-munged sums could be added in groups of 65,536 (representing 67,108,864 input words).

When to use parallel counting - MIT HAKMEM for bitcount when memory is an issue?

4 Answers4