7

I write a algorithm (taken from "The C Programming Language") that counts the number of 1-bits very fast:

int countBit1Fast(int n)
{
    int c = 0;
    for (; n; ++c)
        n &= n - 1;
    return c;
}

But a friend told me that __builtin__popcount(int) is a lot faster, but less portable. I give it a try and was MANY times faster! Why it's so fast? I want to count bits as fast as possible, but without stick to a particular compiler.

EDIT: I may use it on PIC micro-controllers and maybe on non-intel processors, so I need the maximum portability.

  • 4
    Because it is using HW instruction – Eugene Sh. Jul 17 '18 at 18:22
  • What's HW instruction? –  Jul 17 '18 at 18:23
  • It's assembler? –  Jul 17 '18 at 18:25
  • 3
    HW = hardware instruction – Dillon Davis Jul 17 '18 at 18:27
  • 1
    You can use the intel intrinsic variant instead if you want something portable across other compilers: `_mm_popcnt_u64()`, and then include ``. – Dillon Davis Jul 17 '18 at 18:28
  • I may use it on a PIC and maybe on non-intel processors. It will still work? –  Jul 17 '18 at 18:31
  • I'm not certain of the non-intel processors part, but position independent code is not an issue. I will check on the non-intel variants – Dillon Davis Jul 17 '18 at 18:33
  • No, @TheCrow. It's not even very portable across Intel-based platforms. For instance, although I was a bit surprised to find that my GCC in fact recognized `nmmintrin.h` at all, it did not successfully compile even a trivial program that included that header. – John Bollinger Jul 17 '18 at 18:33
  • @JohnBollinger I was able to reproduce that issue, but discovered the problem- the intrinsics are specified to always inline, but unless you specify an architecture the inlining fails. Use the compiler flag `-march=native` for example, to get it to compile. – Dillon Davis Jul 17 '18 at 18:44
  • @DillonDavis, which works for GCC specifically, at least for architectures that are actually supported. I'm not seeing how that does much to bolster the portability assertion. – John Bollinger Jul 17 '18 at 18:47
  • @JohnBollinger in the original question, OP only questions portability in regards to which compiler is used. If it also needs to be portable in regards to target architecture, then intel intrinsics would not be the way to go. – Dillon Davis Jul 17 '18 at 18:52
  • That's right, I edited my question, now it's focused on portability. –  Jul 17 '18 at 18:57
  • @DillonDavis, I'm saying that "it works if you can figure out the right compiler options" is not a strong argument for portability. – John Bollinger Jul 17 '18 at 19:32
  • See salso [this question](https://stackoverflow.com/questions/109023/how-to-count-the-number-of-set-bits-in-a-32-bit-integer) for additional techniques. – Steve Summit Jul 18 '18 at 03:21
  • 4
    @JohnBollinger you don't need to worry about portability on gcc/clang targets, just use `__builtin__popcount` and it'll use the bit count instruction if available, otherwise it'll use [some optimized procedure for that architecture](https://godbolt.org/g/vkjZm4) – phuclv Jul 26 '18 at 01:04
  • I see there is a lot of typos - It is `__builtin_popcount`, not `__builtin__popcount` - there is only one underscore between "builtin" and "popcount" – Šimon Hrabec Jul 02 '23 at 15:17

4 Answers4

6

I write a algorithm (taken from "The C Programming Language") that counts the number of 1-bits very fast:

I don't see why anyone would characterize your approach as "very fast". It's a bit clever, and it should be faster on average than naive alternatives. It also does not depend on the width of the representation of int, which is a plus. I observe that it has undefined behavior for negative arguments, but that's a common theme for bitwise operators and functions.

Let's analyze, supposing a non-negative argument:

int c = 0;
for (; n; ++c)
    n &= n - 1;
  • How many loop iterations are performed?

    1 for each 1 bit in the binary representation of the value, irrespective of where in the value each bit lies

  • How much work is performed per iteration

    • one increment of c
    • one comparison of n against zero (plus one more of these when breaking out of the loop)
    • one decrement of n by 1
    • one bitwise 'and'

    That ignores reads and stores, which very likely can be made free or especially cheap by keeping the operands in registers. If we assume equal cost for each of those, that's four operations per iteration. For random 32-bit integers, there will be an average of 16 iterations, for a total of 65 operations on average. (Best case is just one operation, but worst is 129, which is no better than a naive implementation).

__builtin_popcount(), on the other hand, uses a single instruction regardless of input on platforms that support it, such as yours very likely is. Even on those that don't have a for-purpose instruction, however, it can be done faster (on average).

@dbush has presented one such mechanism that has similar advantages to the one you present. In particular, it does not depend on a pre-chosen integer width, and although it does depend on where in the representation the 1 bits reside, it does run faster for some arguments (smaller ones) than others. If I'm counting right, that one will average around 20 operations on random 32-bit inputs: five in each of four loop iterations (only 0.4% of random inputs would require fewer than four iterations). I'm counting one table read per iteration there, which I assume can be served from cache, but which is probably still not as fast as an arithmetic operation on values already held in registers.

One that is strictly computational would be:

int countBit1Fast(uint32_t n) {
    n = (n & 0x55555555u) + ((n >> 1) & 0x55555555u);
    n = (n & 0x33333333u) + ((n >> 2) & 0x33333333u);
    n = (n & 0x0f0f0f0fu) + ((n >> 4) & 0x0f0f0f0fu);
    n = (n & 0x00ff00ffu) + ((n >> 8) & 0x00ff00ffu);
    n = (n & 0x0000ffffu) + ((n >>16) & 0x0000ffffu);
    return n;
}

That's pretty easy to count: five additions, five shifts, and ten bitwise 'and' operations, and 5 loads of constants for a total of 25 operations for every input (and it goes up only to 30 for 64-bit inputs, though those are now 64-bit operations instead of 32-bit ones). This version is, however, intrinsically dependent on a particular size of the input data type.

John Bollinger
  • 160,171
  • 8
  • 81
  • 157
4

As others have mentioned, __buildin__popcount() is fast because it uses a single x86 instruction.

If you want something faster than what you have that doesn't use anything processor or compiler specific you can create a lookup table with 256 entries:

int bitcount[] = {
    0, 1, 1, 2, 1, 2, 2, 3, 1, 2, 2, 3, 2, 3, 3, 4,
    1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5,
    1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5,
    2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
    1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5,
    2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
    2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
    3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7,
    1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5,
    2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
    2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
    3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7,
    2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
    3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7,
    3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7,
    4, 5, 5, 6, 5, 6, 6, 7, 5, 6, 6, 7, 6, 7, 7, 8,
};

Then use that to get the bit count of each byte:

int countBit1Fast(int n) 
{
    int i, count = 0;
    unsigned char *ptr = (unsigned char *)&n;
    for (i=0;i<sizeof(int);i++) {
        count += bitcount[ptr[i]];
    }
    return count;
}
dbush
  • 205,898
  • 23
  • 218
  • 273
  • Your implementation is very good, it's faster than mine for smalls inputs (which is almost all cases). Thanks! –  Jul 17 '18 at 19:55
1

As others have mentioned, on x86_64 you have a popcount CPU instruction that will thoroughly trounce any software implementation.

In the absence of a CPU popcount instruction, which method is fastest depends on word size, lookup speed (which may depend on CPU cache behaviour), and the effectiveness of super-scalar pipelining.

The simple approach of taking each byte, looking it up in a table, and adding together these values is quite quick, taking about ceil(num_bits/8)*3-1) operations, depending on how "array fetch" works.

There's another less well known method that works by grouping bits into runs, then repeatedly creating half as many runs that are twice the size as before, where each run contains the sum of two previous runs.

This algorithm takes 4×log₂(num_bits))-1 steps, which means it performs comparatively poorly for small integer sizes, but improves for larger ones:

int size (bits) ops (lookup) ops (run-add)
8 2 11
16 5 15
32 11 19
64 23 23
128 47 27
256 95 31

Initially you start with every bit in its own run; then you take pairs of sets and add them together, so each is a number from 0 to 2 inclusive, which conveniently fits in a 2-bit unsigned int:

x = (x >> 1 & 0x55555555555555555555555555555555)
   +(x      & 0x55555555555555555555555555555555);

Now every pair of bits contains a number from 0 to 2, indicating how many bits used to be set in that pair. The subsequent steps are then fairly straightforward: combine adjacent runs into new runs that are twice the width:

x = (x >> 2 & 0x33333333333333333333333333333333)
   +(x      & 0x33333333333333333333333333333333);

Now each run of 4 bits contains a number from 0 to 4. Since those numbers fit in 3 bits, the top bit of each run will always be 0, and doesn't need to be included in the mask.

x = (x >> 4 & 0x07070707070707070707070707070707)
   +(x      & 0x07070707070707070707070707070707);

Now each run of 8 bits contains a number from 0 to 8. Since those numbers fit in 4 bits, the top 12 bits of each run will always be 0, and don't need to be included in the mask.

x = (x >> 8 & 0x000f000f000f000f000f000f000f000f)
   +(x      & 0x000f000f000f000f000f000f000f000f);

Now each run of 16 bits contains a number from 0 to 16. Since those numbers fit in 5 bits, the top 27 bits of each run will always be 0, and don't need to be included in the mask.

x = (x >>16 & 0x0000001f0000001f0000001f0000001f)
   +(x      & 0x0000001f0000001f0000001f0000001f);

Now each run of 32 bits contains a number from 0 to 32. Since those numbers fit in 6 bits, the top 58 bits of each run will always be 0, and don't need to be included in the mask.

x = (x >>32 & 0x000000000000003f000000000000003f)
   +(x      & 0x000000000000003f000000000000003f);

Now each run of 64 bits contains a number from 0 to 64. Since those numbers fit in 7 bits, the top 121 bits of each run will always be 0, and don't need to be included in the mask.

x = (x >>64 & 0x0000000000000000000000000000007f)
   +(x      & 0x0000000000000000000000000000007f);

In general, for step i, pre-compute

w0 = 1<<i;      /* number of bits per run for THIS cycle */
w1 = 1<<i+1;    /* number of bits per run for NEXT cycle */
r1 = w1-1;      /* mask for a number from 0 .. w0 inclusive */

/* Create a pattern of bits with a 1 every w1 bits: */
m1 = 1 << w1;
m3 = UINTMAX / (m1 - 1);

m4 = m3 * r1;

shift[i] = w0;
mask[i] = m4;

/* for the variant below */
m0 = 1 << w0;
s_mult[i] = m0 - 1;

and then for each step use:

x = (x >> shift[i] & mask[i])
   +(x             & mask[i]);

Depending on how fast your CPU can do multiplication, this might make better use of pipelining:

x -=  x >> 1 & 0x55555555555555555555555555555555;
x -= (x >> 2 & 0x33333333333333333333333333333333) * 3;
x -= (x >> 4 & 0x07070707070707070707070707070707) * 0xf;
x -= (x >> 8 & 0x000f000f000f000f000f000f000f000f) * 0xff;
x -= (x >>16 & 0x0000001f0000001f0000001f0000001f) * 0xffff;
x -= (x >>32 & 0x000000000000003f000000000000003f) * 0xffffffff;
x -= (x >>64 & 0x0000000000000000000000000000007f) * 0xffffffffffffffff;

y -= (x >> shift[i] & mask[i]) * s_mult[i];
Martin Kealey
  • 546
  • 2
  • 11
0

The __builtin__popcount(unsigned int) is so fast because it is a gcc extension that utilizes a builtin hardware instruction. If you are willing to trade architecture portability for compiler portability, look into the just-as-fast intel intrinsic functions, specifically:

_mm_popcnt_u32(unsigned __int32);
_mm_popcnt_u64(unsigned __int64);

You must then include the <mmintrin.h> header file to use these intrinsic functions, however they will work with non-gcc compilers. You may also have to supply a target architecture to get the functions to inline (which is strictly required), using something like -march=native.

Seonghyeon Cho
  • 171
  • 1
  • 3
  • 11
Dillon Davis
  • 6,679
  • 2
  • 15
  • 37
  • 2
    `__builtin_popcount()` uses a hardware instruction on platforms that have a suitable one, at least, agreed. But I cannot agree with recommending `_mm_popcount_u*()` as being particularly more portable. – John Bollinger Jul 17 '18 at 18:36
  • @JohnBollinger updated to mention the trade-off between compiler portability and architecture portability. – Dillon Davis Jul 17 '18 at 18:49
  • 3
    it only uses are hardware instruction if the hardware supports it, otherwise [it translates to an optimized function](https://godbolt.org/g/vkjZm4) – phuclv Jul 26 '18 at 01:04
  • 1
    Intel intrinsics `_mm_popcnt_u32` or `_popcnt32` are portable **across x86 compilers** (including MSVC). GNU C `__builtin_popcnt` / `__builtin_popcntll` are portable **across architectures** on every compiler that supports the extension (including GCC and clang, and ICC). You always need `-march=`whatever to get a `popcnt` instruction on x86 where it's unfortunately not baseline, but `__builtin_popcnt` will fall back to a bithack if HW popcnt isn't available. – Peter Cordes Nov 13 '22 at 06:48
  • If you're using C++20, finally there's a `std::popcount`, but ISO C still doesn't have a portable way to expose this common integer operation that most CPUs have instructions for. Unfortunately C is still stuck in the dark ages for portable integer operations like bit-scan and popcount. – Peter Cordes Nov 13 '22 at 06:48