How to quickly count bits into separate bins in a series of ints on Sandy Bridge?

Question

Update: Please read the code, it is NOT about counting bits in one int

Is it possible to improve performance of the following code with some clever assembler?

uint bit_counter[64];

void Count(uint64 bits) {
  bit_counter[0] += (bits >> 0) & 1;
  bit_counter[1] += (bits >> 1) & 1;
  // ..
  bit_counter[63] += (bits >> 63) & 1;
}

Count is in the inner-most loop of my algorithm.

Update: Architecture: x86-64, Sandy Bridge, so SSE4.2, AVX1 and older tech can be used, but not AVX2 or BMI1/2.

bits variable has almost random bits (close to half zeros and half ones)

You can, at the very least, make that `Count` function prettier: `for(int i=0; i < 64; ++i) bit_counter[i] += (bits >> i) & 1;`. — Xeo, Oct 17 '11 at 13:29
How wide do the bit counters have to be? Must they be uints or is narrower allowed? — harold, Oct 17 '11 at 13:39
@Lukasz: Are you sure? I'd leave the loop unrolling to the compiler. Since the bounds of the for loop are compile time constants, I'm pretty sure that can and will be optimized by a good compiler. — Xeo, Oct 17 '11 at 14:51
I don't think so that using 128 bit SSE registers is faster than clasic 64 bit registers. Check my second version. — GJ., Oct 17 '11 at 19:51
@Xeo: Time it. Prettier (in this case) is significantly slower. Łukasz Lew's manual loop unrolling *is* faster -- at least with `gcc 4.2` and `LLVM` — the wolf, Oct 18 '11 at 00:32

Paul R · Answer 1 · 2011-10-17T14:10:17.247

8

You could try doing it with SSE, incrementing 4 elements per iteration.

Warning: untested code follows...

#include <stdint.h>
#include <emmintrin.h>

uint32_t bit_counter[64] __attribute__ ((aligned(16)));
                     // make sure bit_counter array is 16 byte aligned for SSE

void Count_SSE(uint64 bits)
{
    const __m128i inc_table[16] = {
        _mm_set_epi32(0, 0, 0, 0),
        _mm_set_epi32(0, 0, 0, 1),
        _mm_set_epi32(0, 0, 1, 0),
        _mm_set_epi32(0, 0, 1, 1),
        _mm_set_epi32(0, 1, 0, 0),
        _mm_set_epi32(0, 1, 0, 1),
        _mm_set_epi32(0, 1, 1, 0),
        _mm_set_epi32(0, 1, 1, 1),
        _mm_set_epi32(1, 0, 0, 0),
        _mm_set_epi32(1, 0, 0, 1),
        _mm_set_epi32(1, 0, 1, 0),
        _mm_set_epi32(1, 0, 1, 1),
        _mm_set_epi32(1, 1, 0, 0),
        _mm_set_epi32(1, 1, 0, 1),
        _mm_set_epi32(1, 1, 1, 0),
        _mm_set_epi32(1, 1, 1, 1)
    };

    for (int i = 0; i < 64; i += 4)
    {
        __m128i vbit_counter = _mm_load_si128(&bit_counter[i]);
                                          // load 4 ints from bit_counter
        int index = (bits >> i) & 15;     // get next 4 bits
        __m128i vinc = inc_table[index];  // look up 4 increments from LUT
        vbit_counter = _mm_add_epi32(vbit_counter, vinc);
                                          // increment 4 elements of bit_counter
        _mm_store_si128(&bit_counter[i], vbit_counter);
    }                                     // store 4 updated ints
}

How it works: essentially all we are doing here is vectorizing the original loop so that we process 4 bits per loop iteration instead of 1. So we now have 16 loop iterations instead of 64. For each iteration we load 4 bits from bits, then use them as an index into a LUT which contains all possible combinations of 4 increments for the current 4 bits. We then add these 4 increments to the current 4 elements of bit_counter.

The number of loads and stores and adds is reduced by a factor of 4, but this will be offset somewhat by the LUT load and other housekeeping. You may still see a 2x speed up though. I'd be interested to know the result if you do decide to try it.

edited Oct 17 '11 at 14:10

answered Oct 17 '11 at 13:44

Paul R

208,748
37
389
560

2

wow. I really can't see what this does. But it sure looks promising. – sehe Oct 17 '11 at 13:57
Can you explain how the code works? – Łukasz Lew Oct 17 '11 at 14:03
Sure - I've just added some comments to the code which you may not have seen yet - I'll put some additional explanation shortly. – Paul R Oct 17 '11 at 14:05
Not sure about that LUT - [PSHUFB](http://0x80.pl/articles/sse-popcount.html) suggests that you can use an XMM register for a parallel LUT instead, which should be a lot faster than sequential memory lookups. (The linked article uses `PSHUFB` for a regular popcount, but that's obviously not the only possible use of a LUT). – MSalters Oct 17 '11 at 14:36
@MSalters: I don't think you can do a PSHUB lookup table in this case, since the lookup table would need to be different for each element. – Paul R Oct 17 '11 at 15:04
I don't think so; the LUT would still be the same: The values 0-15 (16x4 bits=64), but now packed in a 128 bits vector. However, I've realized that the "parallel" lookup might be a bit less parallel than I first anticipated. Needs more thinking, though. – MSalters Oct 17 '11 at 15:30
I don't think this works I am afraid... – Oct 18 '11 at 02:48
@Colt 45: what doesn't work ? The answer above, or the PSHUFB LUT idea ? – Paul R Oct 18 '11 at 06:06
@Paul R: Sorry for the ambiguity. ;-) The answer above produces a different result for `bit counter` than the OP's code and, regardless, it is slower than the original code -- at least as I compiled it. – Oct 18 '11 at 16:49
@Colt 45: OK - I did say it was untested code. ;-) It's possible that there are bugs, e.g. the values in the lookup table entries may need to be reversed. Also performance will depend on what CPU you are using - ideally you would want to run this on current generation Intel x86-64 CPUs, e.g. Core i7. – Paul R Oct 18 '11 at 16:52

harold · Accepted Answer · 2011-10-17T13:46:45.427

7

Maybe you can do 8 at once, by taking 8 bits spaced 8 apart and keeping 8 uint64's for the counts. That's only 1 byte per single counter though, so you can only accumulate 255 invocations of count before you'd have to unpack those uint64's.

edited Oct 17 '11 at 13:46

answered Oct 17 '11 at 13:36

harold

61,398
6
86
164

This is VERY interesting. Best answer so far. – Łukasz Lew Oct 17 '11 at 14:00
In theory, you can do the unpacking with quite similar logic. Unpack the `uint64[8]` to an `uint64[16]` every 256*255 iterations. However, you'll quickly notice that this second stage is far less common and therefore adds <1% of an improvement. I.e. the first step is already so efficient that you're hard-pressed to improve it further. This algortihm also lends itself quite well to SSE optimization. – MSalters Oct 17 '11 at 14:11

sehe · Answer 3 · 2011-10-17T13:43:27.760

5

Look at Bit Twiddling Hacks

Counting bits set

Edit As for the 'bit position bucket accumulation' (bit_counter[]) I have a feeling that this might be a good case for valarrays + masking. That'd be a fair bit of coding+testing+profiling though. Let me know if you are really interested.

You could, these days, come very close to valarray behaviour using tied tuples (TR1, boost or C++11); I have a feeling it would come out being simpler to read and slower to compile.

edited Oct 17 '11 at 13:43

answered Oct 17 '11 at 13:33

sehe

374,641
47
450
633

+1, I was just about to suggest the same webpage. Using the lookup table you can do it in `O(1)`. – Gaston Oct 17 '11 at 13:34
@PaulR: I don't agree. It frames the subject. I share my thoughts on how to apply it to this question (you can use a lookup per byte to store valarray masks. Or mimic the same using tuples. A vectorizing compiler (Intel, gcc) will likely optimize the result into proper SIMD instructions. Sadly, I don't have time to work this out right away :) – sehe Oct 17 '11 at 13:53
9

This does not answer my question in any way. Also I don't see valarrays helping here. – Łukasz Lew Oct 17 '11 at 13:54
5

Although the bit twiddling hacks are indeed very useful I don't think they really help with this particular question. – Paul R Oct 17 '11 at 13:54
@Gastón: The O(1) is idealized and ignores the effects of lookup table size (cache locality: if the table becomes too big, things will stop scaling - the algorithmic complexity may be O(1) but that doesn't mean the CPU is able to execute it efficiently, because the CPU is not a perfect model) – sehe Oct 17 '11 at 14:35
Not responsive to the question... – Oct 18 '11 at 02:51

Igor Skochinsky · Answer 4 · 2018-04-30T15:46:48.573

4

Apparently this can be done quickly with "vertical counters". From the now-defunct page on Bit tricks (archive) by @steike:

Consider a normal array of integers, where we read the bits horizontally:
       msb<-->lsb
  x[0]  00000010  = 2
  x[1]  00000001  = 1
  x[2]  00000101  = 5
A vertical counter stores the numbers, as the name implies, vertically; that is, a k-bit counter is stored across k words, with a single bit in each word.
  x[0]  00000110   lsb ↑
  x[1]  00000001       |
  x[2]  00000100       |
  x[3]  00000000       |
  x[4]  00000000   msb ↓
             512
With the numbers stored like this, we can use bitwise operations to increment any subset of them all at once.

We create a bitmap with a 1 bit in the positions corresponding to the counters we want to increment, and loop through the array from LSB up, updating the bits as we go. The "carries" from one addition becomes the input for the next element of the array.
  input  sum

--------------------------------------------------------------------------------
   A B   C S
   0 0   0 0
   0 1   0 1      sum    = a ^ b
   1 0   0 1      carry  = a & b
   1 1   1 1

  carry = input;
  long *p = buffer;
  while (carry) {
    a = *p; b = carry;
    *p++ = a ^ b;
    carry = a & b;
  }
For 64-bit words the loop will run 6-7 times on average -- the number of iterations is determined by the longest chain of carries.

edited Apr 30 '18 at 15:46

answered Oct 20 '11 at 12:08

Igor Skochinsky

24,629
2
72
109

Is this a transpose of a 64-element x 64-bit block or something? Or a transpose of 1 x 64-bit element to 64 x 1-bit elements? But `buffer[]` holds `long`s, not bits. What do you do with the result? – Peter Cordes Apr 30 '18 at 07:01
@PeterCordes you would have to extract the final counter from the vertical bit column. should be doable with a short loop of bit shifts, masking and OR operations. – Igor Skochinsky Apr 30 '18 at 07:11
It's been a while since I last looked into this problem but IIRC the advantage here is that the add step is much faster than the traditional array of counters, especially if some bits are zero. – Igor Skochinsky Apr 30 '18 at 15:44
1

Oh, I see, so `uint64_t buffer[64]` is a data structure for accumulating counts in each bit position separately, doing a vertical full-adder manually. At the end you loop through it once to get the actual counts for each bit position. You could use SIMD for this, running 2x or 4x 64-bit accumulators in parallel (on Sandybridge, probably 128-bit AVX is the best bet). You'd want to peel the first 4 to 6 or so iterations and do them purely in registers with no early-termination check. Maybe peel enough so you're usually done with no branch mispredict or loop. – Peter Cordes Apr 30 '18 at 16:11

GJ. · Answer 5 · 2011-10-18T12:39:31.973

You can unroll your function like this. It is probably faster than what your compiler can do!

//   rax as 64 bit input
   xor  rcx, rcx                //clear addent

   add  rax, rax                //Copy 63th bit to carry flag
   adc  dword ptr [@bit_counter + 63 * 4], ecx    //Add carry bit to counter[64]

   add  rax, rax                //Copy 62th bit to carry flag
   adc  dword ptr [@bit_counter + 62 * 4], ecx    //Add carry bit to counter[63]

   add  rax, rax                //Copy 62th bit to carry flag
   adc  dword ptr [@bit_counter + 61 * 4], ecx    //Add carry bit to counter[62]
//   ...
   add  rax, rax                //Copy 1th bit to carry flag
   adc  dword ptr [@bit_counter + 1 * 4], ecx     //Add carry bit to counter[1]

   add  rax, rax                //Copy 0th bit to carry flag
   adc  dword ptr [@bit_counter], ecx             //Add carry bit to counter[0]

EDIT:

You can try also with double increment like this:

//   rax as 64 bit input
   xor  rcx, rcx                //clear addent
//
   add  rax, rax                //Copy 63th bit to carry flag
   rcl  rcx, 33                 //Mov carry to 32th bit as 0bit of second uint
   add  rax, rax                //Copy 62th bit to carry flag
   adc  qword ptr [@bit_counter + 62 * 8], rcx  //Add rcx to 63th and 62th counters

   add  rax, rax                //Copy 61th bit to carry flag
   rcl  rcx, 33                 //Mov carry to 32th bit as 0bit of second uint
   add  rax, rax                //Copy 60th bit to carry flag
   adc  qword ptr [@bit_counter + 60 * 8], rcx  //Add rcx to 61th and 60th counters
//...

This probably will not be much faster, and not worth the maintenance effort... — , Oct 18 '11 at 02:50
@Colt 45: probably? Probably is second solution the fastest! Only `harolds` solution can be faster. — GJ., Oct 18 '11 at 08:52
Can you back that up with timings? The OP's code probably is compiled into something nearly like this is my point. It may be faster, but not by orders of magnitude. — , Oct 18 '11 at 16:42
Your first version looks good, but what CPU did you test the `rcl` version on? According to http://agner.org/optimize/, `rcl r,imm` (i.e. a count other than 1) is quite slow. Like 8 uops on Intel Sandybridge-family (with 1 per 6c throughput), or 9 uops on AMD K10 (1 per 3 cycles). Or 16 uops on Bulldozer-family / 7 cycle latency. — Peter Cordes, Apr 29 '18 at 12:37

score 2 · Answer 6 · answered Oct 23 '11 at 19:29

You could use the set of counters, each of different size. Firstly accumulate 3 values in 2-bit counters, then unpack them and update 4-bit counters. When 15 values are ready, unpack to byte-sized counters, and after 255 values update bit_counter[].

All this work may be done in parallel in 128 bit SSE registers. On modern processors only one instruction is needed to unpack 1 bit to 2. Just multiply source quadword to itself with PCLMULQDQ instruction. This will interleave source bits with zeros. The same trick may help to unpack 2 bits to 4. And unpacking of 4 and 8 bits may be done with shuffles, unpacks and simple logical operations.

Average performance seems to be good, but the price is 120 bytes for additional counters and quite a lot of assembly code.

score 1 · Answer 7 · answered Oct 17 '11 at 13:19

There's no way to answer this in general; it all depends on the compiler and the underlying architecture. The only real way to know is to try different solutions, and measure. (On some machines, for example, shifts can be very expensive. On others, no.) For starters, I'd use something like:

uint64_t mask = 1;
int index = 0;
while ( mask != 0 ) {
    if ( (bits & mask) != 0 ) {
        ++ bit_counter[index];
    }
    ++ index;
    mask <<= 1;
}

Unrolling the loop completely will likely improve performance. Depending on the architecture, replacing the if with:

bit_counter[index] += ((bits & mask) != 0);

might be better. Or worse... it's impossible to know in advance. It's also possible that on some machines, systematically shifting into the low order bit and masking, as you are doing, would be best.

Some optimizations will also depend on what typical data looks like. If most of the words only have one or two bits set, you might gain by testing a byte at at time, or four bits at a time, and skipping those that are all zeros completely.

Branching in your example is much more costly than the original code. — Łukasz Lew, Oct 17 '11 at 13:26
@ŁukaszLew Maybe. That's why I presented several different solutions, with the recommendation to measure them. Different processors take different times for different operations. (I've used processors where shifting multiple bits was very slow, for example.) — James Kanze, Oct 17 '11 at 15:39
@ŁukaszLew it all depends on the optimizer; compilers have become very complex nowadays. Presented implementations are -at least - compact, straightforward and maintainable. More effort sometimes decreases the catch, unless you exactly know what you are doing and turn off all options, programming as low level and platform specific as it gets. Such sort of hand optimizations are normally supposed to take place very late in the production schedule. — Red.Wave, Apr 30 '18 at 17:34

score 1 · Answer 8 · answered Oct 17 '11 at 15:46

1

If you count how often each nibble (16 possibilities) occurs at each offset (16 possibilities), you can easily sum the results. And those 256 sums are easily kept:

unsigned long nibble_count[16][16]; // E.g. 0x000700B0 corresponds to [4][7] and [2][B]
unsigned long bitcount[64];

void CountNibbles(uint64 bits) {
  // Count nibbles
  for (int i = 0; i != 16; ++i) {
     nibble_count[i][bits&0xf]++;
     bits >>= 4;
  }
}
void SumNibbles() {
  for (int i = 0; i != 16; ++i) {
    for (int nibble = 0; nibble != 16; ++nibble) {
        for(int bitpos = 0; bitpos != 3; ++bitpos) {
           if (nibble & (1<<bitpos)) {
              bitcount[i*4 + bitpos] += nibble_count[i][nibble];
           }
        }
     }
   }
}

answered Oct 17 '11 at 15:46

MSalters

173,980
10
155
350

why not go all the way and store count per byte? e.g `byte_count[8][256]`? Then your loop unrolls quite nicely. Should have the same impact on the cache... – Nim Oct 17 '11 at 16:24
@Nim: Not entirely; `byte_count` has 8 times more elements. Since it's accessed at random, that's quite unfortunate for the L1 cache. But profiling should quickly show what's most efficient. And, of course, you can also profile `quintet_count[13][32]` etc ;) – MSalters Oct 18 '11 at 06:57
You probably don't need 64-bit counters, so you could cut your cache footprint in half. You're already using `uint64_t`, so you could use `uint32_t nibble_count` instead of `unsigned long` which is 32-bit on x86-64 Windows, but 64-bit on other x86-64 OSes. – Peter Cordes Apr 29 '18 at 12:51

the wolf · Answer 9 · 2011-10-18T00:38:30.350

This is fairly fast:

void count(uint_fast64_t bits){
    uint_fast64_t i64=ffs64(bits);
    while(i64){
        bit_counter[i64-1]++;
        bits=bits & 0xFFFFFFFFFFFFFFFF << i64;
        i64=ffs64(bits);
    }
}

You need to have a fast implementation of ffs for 64 bits. For most compiler and CPU's this is a single instruction. The loop is executed once for each bit in the word, so bits=0 will be very fast and bits being 64 bits of 1 will be slower.

I tested this under 64 bit Ubuntu with GCC, and it produces the same data output as your:

void Count(uint64 bits) {
  bit_counter[0] += (bits >> 0) & 1;
  bit_counter[1] += (bits >> 1) & 1;
  // ..
  bit_counter[63] += (bits >> 63) & 1;
}

The speed is variable based on the number of 1 bits in the 64 bit word.

+1 for effort, best alternative here, but this won't be as fast as his original... — , Oct 18 '11 at 02:49

How to quickly count bits into separate bins in a series of ints on Sandy Bridge?

9 Answers9

Linked