1007

8 bits representing the number 7 look like this:

00000111

Three bits are set.

What are the algorithms to determine the number of set bits in a 32-bit integer?

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
Matt Howells
  • 40,310
  • 20
  • 83
  • 102
  • 123
    This is the Hamming weight BTW. – Purfideas Sep 20 '08 at 19:17
  • 13
    What's a real-world application for this? (This isn't to be taken as a criticism--I'm just curious.) – jonmorgan Dec 10 '10 at 20:59
  • 11
    Calculation of parity bit (look it up), which was used as simple error detection in communication. – Dialecticus Dec 11 '10 at 00:28
  • 8
    @Dialecticus, calculating a parity bit is [cheaper](http://www-graphics.stanford.edu/~seander/bithacks.html#ParityParallel) than calculating the Hamming weight – finnw May 12 '11 at 12:14
  • 18
    @spookyjon Let's say you have a graph represented as an adjacency matrix, which is essentially a bit set. If you want to calculate the number of edges of a vertex, it boils down to calculating the Hamming weight of one row in the bit set. – fuz Oct 10 '11 at 16:02
  • 2
    [US patent 6,516,330 – Counting set bits in data words](http://www.google.com/patents/US6516330) – Hot Licks Mar 09 '12 at 12:03
  • Here is a wiki link to algorithms: http://en.wikipedia.org/wiki/Hamming_weight – C0D3 Mar 22 '12 at 20:12
  • 1
    'Best' is not well-defined but would have to mean you can't even use a 256-entry * 3 bits lookup table. All these computational approaches will underperform using a simple 64K-entry (* 5 bits) lookup table on the hi- and low- 16 bits, and one addition. Or a 256-entry table and three additions. – smci Apr 07 '12 at 08:32
  • 2
    @jonmorgan When guessing the keylength of a XOR-crypted cipher, a naive version of this computation takes about 90% of the processing time. – toasted_flakes Aug 02 '13 at 23:30
  • Cyclic Redundancy Check ? – Sachin Gadagi Sep 30 '13 at 15:09
  • 2
    Application: you can easily count the number of flags set on a `[Flags()]` enum. – test Nov 11 '14 at 04:14
  • @jonmorgan There is a strange data structure in OpenType, where a flags byte defines what data is included in a record. The size of the record is 2 times the number of set bits. Here is a link if you are ready for a deep dive into font formats: https://learn.microsoft.com/en-us/typography/opentype/spec/gpos#value-record – Waruyama Mar 07 '21 at 12:26

65 Answers65

983

This is known as the 'Hamming Weight', 'popcount' or 'sideways addition'.

Some CPUs have a single built-in instruction to do it and others have parallel instructions which act on bit vectors. Instructions like x86's popcnt (on CPUs where it's supported) will almost certainly be fastest for a single integer. Some other architectures may have a slow instruction implemented with a microcoded loop that tests a bit per cycle (citation needed - hardware popcount is normally fast if it exists at all.).

The 'best' algorithm really depends on which CPU you are on and what your usage pattern is.

Your compiler may know how to do something that's good for the specific CPU you're compiling for, e.g. C++20 std::popcount(), or C++ std::bitset<32>::count(), as a portable way to access builtin / intrinsic functions (see another answer on this question). But your compiler's choice of fallback for target CPUs that don't have hardware popcnt might not be optimal for your use-case. Or your language (e.g. C) might not expose any portable function that could use a CPU-specific popcount when there is one.


Portable algorithms that don't need (or benefit from) any HW support

A pre-populated table lookup method can be very fast if your CPU has a large cache and you are doing lots of these operations in a tight loop. However it can suffer because of the expense of a 'cache miss', where the CPU has to fetch some of the table from main memory. (Look up each byte separately to keep the table small.) If you want popcount for a contiguous range of numbers, only the low byte is changing for groups of 256 numbers, making this very good.

If you know that your bytes will be mostly 0's or mostly 1's then there are efficient algorithms for these scenarios, e.g. clearing the lowest set with a bithack in a loop until it becomes zero.

I believe a very good general purpose algorithm is the following, known as 'parallel' or 'variable-precision SWAR algorithm'. I have expressed this in a C-like pseudo language, you may need to adjust it to work for a particular language (e.g. using uint32_t for C++ and >>> in Java):

GCC10 and clang 10.0 can recognize this pattern / idiom and compile it to a hardware popcnt or equivalent instruction when available, giving you the best of both worlds. (https://godbolt.org/z/qGdh1dvKK)

int numberOfSetBits(uint32_t i)
{
     // Java: use int, and use >>> instead of >>. Or use Integer.bitCount()
     // C or C++: use uint32_t
     i = i - ((i >> 1) & 0x55555555);        // add pairs of bits
     i = (i & 0x33333333) + ((i >> 2) & 0x33333333);  // quads
     i = (i + (i >> 4)) & 0x0F0F0F0F;        // groups of 8
     return (i * 0x01010101) >> 24;          // horizontal sum of bytes
}

For JavaScript: coerce to integer with |0 for performance: change the first line to i = (i|0) - ((i >> 1) & 0x55555555);

This has the best worst-case behaviour of any of the algorithms discussed, so will efficiently deal with any usage pattern or values you throw at it. (Its performance is not data-dependent on normal CPUs where all integer operations including multiply are constant-time. It doesn't get any faster with "simple" inputs, but it's still pretty decent.)

References:


How this SWAR bithack works:

i = i - ((i >> 1) & 0x55555555);

The first step is an optimized version of masking to isolate the odd / even bits, shifting to line them up, and adding. This effectively does 16 separate additions in 2-bit accumulators (SWAR = SIMD Within A Register). Like (i & 0x55555555) + ((i>>1) & 0x55555555).

The next step takes the odd/even eight of those 16x 2-bit accumulators and adds again, producing 8x 4-bit sums. The i - ... optimization isn't possible this time so it does just mask before / after shifting. Using the same 0x33... constant both times instead of 0xccc... before shifting is a good thing when compiling for ISAs that need to construct 32-bit constants in registers separately.

The final shift-and-add step of (i + (i >> 4)) & 0x0F0F0F0F widens to 4x 8-bit accumulators. It masks after adding instead of before, because the maximum value in any 4-bit accumulator is 4, if all 4 bits of the corresponding input bits were set. 4+4 = 8 which still fits in 4 bits, so carry between nibble elements is impossible in i + (i >> 4).

So far this is just fairly normal SIMD using SWAR techniques with a few clever optimizations. Continuing on with the same pattern for 2 more steps can widen to 2x 16-bit then 1x 32-bit counts. But there is a more efficient way on machines with fast hardware multiply:

Once we have few enough "elements", a multiply with a magic constant can sum all the elements into the top element. In this case byte elements. Multiply is done by left-shifting and adding, so a multiply of x * 0x01010101 results in x + (x<<8) + (x<<16) + (x<<24). Our 8-bit elements are wide enough (and holding small enough counts) that this doesn't produce carry into that top 8 bits.

A 64-bit version of this can do 8x 8-bit elements in a 64-bit integer with a 0x0101010101010101 multiplier, and extract the high byte with >>56. So it doesn't take any extra steps, just wider constants. This is what GCC uses for __builtin_popcountll on x86 systems when the hardware popcnt instruction isn't enabled. If you can use builtins or intrinsics for this, do so to give the compiler a chance to do target-specific optimizations.


With full SIMD for wider vectors (e.g. counting a whole array)

This bitwise-SWAR algorithm could parallelize to be done in multiple vector elements at once, instead of in a single integer register, for a speedup on CPUs with SIMD but no usable popcount instruction. (e.g. x86-64 code that has to run on any CPU, not just Nehalem or later.)

However, the best way to use vector instructions for popcount is usually by using a variable-shuffle to do a table-lookup for 4 bits at a time of each byte in parallel. (The 4 bits index a 16 entry table held in a vector register).

On Intel CPUs, the hardware 64bit popcnt instruction can outperform an SSSE3 PSHUFB bit-parallel implementation by about a factor of 2, but only if your compiler gets it just right. Otherwise SSE can come out significantly ahead. Newer compiler versions are aware of the popcnt false dependency problem on Intel.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
Matt Howells
  • 40,310
  • 20
  • 83
  • 102
  • 3
    +1. The first line in your NumberOfSetBits() is very cool -- just 3 instructions, instead of the 4 you would need if you separately masked out the even- and odd-numbered bits and added them (appropriately shifted) together. – j_random_hacker May 25 '09 at 12:23
  • 101
    ha! love the NumberOfSetBits() function, but good luck getting that through a code review. :-) – Jason S Nov 22 '09 at 06:51
  • 41
    Maybe it should use `unsigned int`, to easily show that it is free of any sign bit complications. Also would `uint32_t` be safer, as in, you get what you expect on all platforms? – Craig McQueen Dec 15 '09 at 02:18
  • I checked a few things while answering to this question http://stackoverflow.com/questions/2709430/count-number-of-bits-in-a-64-bit-long-big-integer/2709523#2709523 and I think there is a bug. last line: return ((i + (i >> 4) & 0xF0F0F0F) * 0x1010101) >> 24; should be changed to: return (((i + (i >> 4)) & 0xF0F0F0F) * 0x1010101) >> 24; (first the sum and bitwise & later) – Maciej Hehl Apr 25 '10 at 19:56
  • @Maciej: Can you please provide a case where it does not return the expected result? – Matt Howells Apr 28 '10 at 10:04
  • 3
    @Matt Howells I'm sorry. It looks like I got lost among those parenthesis. I had a bug in my implementation. it didn't work for numbers > 15. I checked the wikipedia article and there were those parenthesis. I was sure that was my problem. I fixed the parenthesis and it started to work. Looks like I fixed something else. But I think I learned something. Inability to reproduce the "bug" made me check the precedence of the operators :). Thank You and I apologize for the mess – Maciej Hehl Apr 29 '10 at 14:19
  • Is there a portable version of this code? For instance, how does it behave with 9 bit bytes or other unusual architectures? – Robert S. Barnes Mar 29 '11 at 07:54
  • 39
    @nonnb: Actually, as written, the code is buggy and needs maintenance. `>>` is implementation-defined for negative values. The argument needs to be changed (or cast) to `unsigned`, and since the code is 32-bit-specific, it should probably be using `uint32_t`. – R.. GitHub STOP HELPING ICE May 14 '11 at 21:55
  • 1
    I took the “write-only” bit, true as it was, as a challenge, and set to decode the code. I got about halfway; I'm not sure exactly what the additions and multiplication do in the context of this function. I've edited the answer to include the explanation as far as I could get it; I invite anyone smarter than me to finish it. – Peter Hosey Aug 04 '11 at 05:36
  • @Peter Hosey: Sorry, but I feel your comments in the code don't add much value and perhaps the explanation would be better in a comment. – Matt Howells Aug 22 '11 at 08:58
  • 7
    It's not really magic. It's adding sets of bits but doing so with some clever optimizations. The wikipedia link given in the answer does a good job of explaining what's going on but I'll go line by line. 1) Count up the number of bits in every pair of bits, putting that count in that pair of bits (you'll have 00, 01, or 10); the "clever" bit here is the subtract that avoids one mask. 2) Add pairs of those sums of bitpairs into their corresponding nibbles; nothing clever here but each nibble will now have a value 0-4. (cont'd) – dash-tom-bang Dec 05 '12 at 00:42
  • 4
    3) does too much on one line, but up to the mask, the nibbles are added up into bytes, then the multiply adds all of the bytes into the high byte, which is then shifted down, leaving the result. – dash-tom-bang Dec 05 '12 at 00:43
  • 8
    Another note, this extends to 64 and 128 bit registers by simply extending the constants appropriately. Interestingly (to me), those constants are also ~0 / 3, 5, 17, and 255; the former three being 2^n+1. This all makes more sense the more you stare at it and think about it in the shower. :) – dash-tom-bang Dec 05 '12 at 00:48
  • 3
    Another note, there's complaints above about >> extension being unspecified for negative ints, however, all of the results from >> are masked in such a way as to discard the extended bits .. except for the final >> which is OK since the upper 2 bits of its input are mathematically guaranteed to be zero. This may apply to the Jave concern as well. – greggo Dec 14 '12 at 14:26
  • What if our input is a byte? I don't believe it's working for me when I cast my input from a byte to an int. – fIwJlxSzApHEZIl Feb 09 '13 at 02:42
  • 2
    The solution by bcdabcd987 is the same algorithm before being optimized to the point of becoming illegible... – Lefteris E Jun 13 '13 at 10:07
  • 2
    http://www.mpi-inf.mpg.de/departments/rg1/teaching/advancedc-ws08/script/lecture10.pdf – Lefteris E Jun 13 '13 at 10:15
  • 2
    For a explanation of this algorithm, refer to page 179-180 of [Software Optimization Guide for AMD Athlon™ 64 and Opteron™ Processors](http://support.amd.com/TechDocs/25112.PDF‎). – Jingguo Yao Dec 25 '13 at 05:53
  • 1
    @R: greggo is correct. If ones are shifted in then they are masked away. The final shift follows a multiplication summing every byte into the high byte. Since the high byte is summing bits set, it will never exceed 32 and only requires at most 6 bits. So the high bit will never be set and the number will never be negative. – Apriori Jun 08 '14 at 20:45
  • 1
    I find it weird that [this answer](http://stackoverflow.com/a/21114060/645270) isn't closer to this one in terms of upvotes. Performance-wise it seems really stable to depend on the number of bits set. Maybe I misunderstood just how fast this one can be. – keyser Feb 06 '15 at 16:34
  • 4
    For Java you can just call [`Integer.bitCount(int)`](http://docs.oracle.com/javase/8/docs/api/java/lang/Integer.html#bitCount-int-) (since 1.5 but that's a while back already). – Maarten Bodewes Mar 18 '15 at 18:42
  • You can also generate a neat lookup table using variadic templates in C++. Still doesn't perform as well as the builtin though. https://bitbucket.org/ldiamond/popcount/src – Lewis Diamond May 30 '16 at 17:06
  • 3
    @JasonS If you cite where you get it from, for example Numerical Recipes or Art of Programming, and someone still denies it from code review as unmaintainable, that person shouldn't be reviewing code. There are some bibles/oracles we simply trust as sources of good algorithms so we don't go reinventing things ourselves. – corsiKa Jul 22 '16 at 19:39
  • 3
    Even with citations, that doesn't mean the code has been copied correctly, or that the original code did not contain errors for edge cases. – Jason S Jul 23 '16 at 19:56
  • 3
    It's not hard to write a test suite that just goes through all 2**32 possible inputs and compares to a transparent reference implementation. Given that there are no loops or conditionals, no variable shifts (and thus no room for UD behavior, if you are using unsigned) that counts as proof. Not so much for the 64-bit version though. Incidentally, __builtin_popcount() (gcc,clang) will generate something very much like this when there is no popcount instruction. – greggo Mar 22 '17 at 15:33
  • Never use shift operators and masks on signed quantities. – Albert van der Horst Mar 05 '19 at 10:52
  • 1
    If implementing this in Javascript, change the first line to `i = (i|0) - ((i >> 1) & 0x55555555);` The `|0` coerces the value to an int32, giving a significant speed-up. Also using >>> over >> seems slightly faster (although as previous comments have pointed out, both will in fact work correctly in this algorithm). See performance tests here: https://jsperf.com/fast-int32-bit-count – Doin Nov 09 '19 at 15:42
  • @greggo: Even better, GCC10 and clang 10.0 both compile this to a `popcnt` instruction when targeting x86. So assuming their idiom recognizers for this are non-buggy, that's rock-solid proof of correctness. – Peter Cordes Aug 13 '21 at 05:38
  • @PeterCordes - I think it is also worth adding: `int count = 0; while (n) { count += n & 1; n >>= 1; } return count;` – Eriss69 Sep 09 '22 at 21:04
  • @Eriss69: There already are answers with that simple but naive method, such as [this one](https://stackoverflow.com/questions/109023/count-the-number-of-set-bits-in-a-32-bit-integer/67619076#67619076) which also shows one that loops only over set bits. – Peter Cordes Sep 10 '22 at 01:34
251

Some languages portably expose the operation in a way that can use efficient hardware support if available, otherwise some library fallback that's hopefully decent.

For example (from a table by language):

  • C++ has std::bitset<>::count(), or C++20 std::popcount(T x)
  • Java has java.lang.Integer.bitCount() (also for Long or BigInteger)
  • C# has System.Numerics.BitOperations.PopCount()
  • Python has int.bit_count() (since 3.10)

Not all compilers / libraries actually manage to use HW support when it's available, though. (Notably MSVC, even with options that make std::popcount inline as x86 popcnt, its std::bitset::count still always uses a lookup table. This will hopefully change in future versions.)

Also consider the built-in functions of your compiler when the portable language doesn't have this basic bit operation. In GNU C for example:

int __builtin_popcount (unsigned int x);
int __builtin_popcountll (unsigned long long x);

In the worst case (no single-instruction HW support) the compiler will generate a call to a function (which in current GCC uses a shift/and bit-hack like this answer, at least for x86). In the best case the compiler will emit a cpu instruction to do the job. (Just like a * or / operator - GCC will use a hardware multiply or divide instruction if available, otherwise will call a libgcc helper function.) Or even better, if the operand is a compile-time constant after inlining, it can do constant-propagation to get a compile-time-constant popcount result.

The GCC builtins even work across multiple platforms. Popcount has almost become mainstream in the x86 architecture, so it makes sense to start using the builtin now so you can recompile to let it inline a hardware instruction when you compile with -mpopcnt or something that includes that (e.g. https://godbolt.org/z/Ma5e5a). Other architectures have had popcount for years, but in the x86 world there are still some ancient Core 2 and similar vintage AMD CPUs in use.


On x86, you can tell the compiler that it can assume support for popcnt instruction with -mpopcnt (also implied by -msse4.2). See GCC x86 options. -march=nehalem -mtune=skylake (or -march= whatever CPU you want your code to assume and to tune for) could be a good choice. Running the resulting binary on an older CPU will result in an illegal-instruction fault.

To make binaries optimized for the machine you build them on, use -march=native (with gcc, clang, or ICC).

MSVC provides an intrinsic for the x86 popcnt instruction, but unlike gcc it's really an intrinsic for the hardware instruction and requires hardware support.


Using std::bitset<>::count() instead of a built-in

In theory, any compiler that knows how to popcount efficiently for the target CPU should expose that functionality through ISO C++ std::bitset<>. In practice, you might be better off with the bit-hack AND/shift/ADD in some cases for some target CPUs.

For target architectures where hardware popcount is an optional extension (like x86), not all compilers have a std::bitset that takes advantage of it when available. For example, MSVC has no way to enable popcnt support at compile time, and it's std::bitset<>::count always uses a table lookup, even with /Ox /arch:AVX (which implies SSE4.2, which in turn implies the popcnt feature.) (Update: see below; that does get MSVC's C++20 std::popcount to use x86 popcnt, but still not its bitset<>::count. MSVC could fix that by updating their standard library headers to use std::popcount when available.)

But at least you get something portable that works everywhere, and with gcc/clang with the right target options, you get hardware popcount for architectures that support it.

#include <bitset>
#include <limits>
#include <type_traits>

template<typename T>
//static inline  // static if you want to compile with -mpopcnt in one compilation unit but not others
typename std::enable_if<std::is_integral<T>::value,  unsigned >::type 
popcount(T x)
{
    static_assert(std::numeric_limits<T>::radix == 2, "non-binary type");

    // sizeof(x)*CHAR_BIT
    constexpr int bitwidth = std::numeric_limits<T>::digits + std::numeric_limits<T>::is_signed;
    // std::bitset constructor was only unsigned long before C++11.  Beware if porting to C++03
    static_assert(bitwidth <= std::numeric_limits<unsigned long long>::digits, "arg too wide for std::bitset() constructor");

    typedef typename std::make_unsigned<T>::type UT;        // probably not needed, bitset width chops after sign-extension

    std::bitset<bitwidth> bs( static_cast<UT>(x) );
    return bs.count();
}

See asm from gcc, clang, icc, and MSVC on the Godbolt compiler explorer.

x86-64 gcc -O3 -std=gnu++11 -mpopcnt emits this:

unsigned test_short(short a) { return popcount(a); }
    movzx   eax, di      # note zero-extension, not sign-extension
    popcnt  rax, rax
    ret

unsigned test_int(int a) { return popcount(a); }
    mov     eax, edi
    popcnt  rax, rax        # unnecessary 64-bit operand size
    ret

unsigned test_u64(unsigned long long a) { return popcount(a); }
    xor     eax, eax     # gcc avoids false dependencies for Intel CPUs
    popcnt  rax, rdi
    ret

PowerPC64 gcc -O3 -std=gnu++11 emits (for the int arg version):

    rldicl 3,3,0,32     # zero-extend from 32 to 64-bit
    popcntd 3,3         # popcount
    blr

This source isn't x86-specific or GNU-specific at all, but only compiles well with gcc/clang/icc, at least when targeting x86 (including x86-64).

Also note that gcc's fallback for architectures without single-instruction popcount is a byte-at-a-time table lookup. This isn't wonderful for ARM, for example.

C++20 has std::popcount(T)

Current libstdc++ headers unfortunately define it with a special case if(x==0) return 0; at the start, which clang doesn't optimize away when compiling for x86:

#include <bit>
int bar(unsigned x) {
    return std::popcount(x);
}

clang 11.0.1 -O3 -std=gnu++20 -march=nehalem (https://godbolt.org/z/arMe5a)

# clang 11
    bar(unsigned int):                                # @bar(unsigned int)
        popcnt  eax, edi
        cmove   eax, edi         # redundant: if popcnt result is 0, return the original 0 instead of the popcnt-generated 0...
        ret

But GCC compiles nicely:

# gcc 10
        xor     eax, eax         # break false dependency on Intel SnB-family before Ice Lake.
        popcnt  eax, edi
        ret

Even MSVC does well with it, as long as you use -arch:AVX or later (and enable C++20 with -std:c++latest). https://godbolt.org/z/7K4Gef

int bar(unsigned int) PROC                                 ; bar, COMDAT
        popcnt  eax, ecx
        ret     0
int bar(unsigned int) ENDP                                 ; bar
Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
Nils Pipenbrinck
  • 83,631
  • 31
  • 151
  • 221
  • 5
    I agree that this is good practice in general, but on XCode/OSX/Intel I found it to generate slower code than most of the suggestions posted here. See my answer for details. –  Sep 25 '08 at 03:29
  • AFAIK the only x86 CPU that could do a pop-count in a single instruction would be the AMD Phenom/Barcelona (Family 10h). has about a latency of 4 cycles or so? – Calyth Jan 09 '09 at 23:38
  • 5
    The Intel i5/i7 has the SSE4 instruction POPCNT which does it, using general purpose registers. GCC on my system does not emit that instruction using this intrinsic, i guess because of no -march=nehalem option yet. – matja Nov 24 '09 at 10:31
  • 3
    @matja, my GCC 4.4.1 emits the popcnt instruction if I compile with -msse4.2 – Nils Pipenbrinck Nov 24 '09 at 13:29
  • 76
    use c++'s `std::bitset::count`. after inlining this compiles to a single `__builtin_popcount` call. – deft_code Sep 04 '10 at 18:18
  • the intrinsic mentioned (_popcnt32/64) is located in immintrin.h and available if one has the POPCNT CPUID Feature Flag. It's not really part of SSE --at least that is my interpretation of the information in the Intrinsics Guide 3.0.1 provided by Intel) – nlucaroni Jul 24 '13 at 17:11
  • 1
    @nlucaroni Well, yes. Times are changing. I've wrote this answer in 2008. Nowadays we have native popcount and the intrinsic will compile down to a single assembler statement if the platform allows that. – Nils Pipenbrinck Jul 24 '13 at 19:45
  • Unfortunately, call/return can be costly, so for inner loops I would prefer an inline version of Matt's NumberOfSetBits(). – Michael Dec 24 '15 at 16:46
  • 1
    @Michael: `__builtin` functions aren't real functions that get called. If compiling for a target that supports it as a single instruction (e.g. x86 with `-mpopcnt`), it inlines to just that. However, without that gcc may emit a call to a libgcc.a function like `__popcountdi2` instead of inlining those instructions. It's up to the compiler, though; clang4.0 chooses to inline, just like it does for `std::bitset.count()`. https://godbolt.org/g/TueMQt – Peter Cordes Aug 06 '17 at 19:04
  • 1
    Just to update the "state of `popcnt` instruction performance" mentioned in some of the above comments, the last several generations of Intel CPUs have been able to issue 1 `popcnt` per cycle, with a latency of 3 cycles, and AMD Zen architecture can issue 4 (!!) `popcnt` instructions per cycle with a latency of 1 cycle. So we can say that `popcnt` is "really fast" on modern hardware, and in AMD's case as fast as trivial instructions like `or` and `add`. – BeeOnRope Aug 06 '17 at 20:25
  • I just noticed that C++-20 now has `std::popcount()` here: https://en.cppreference.com/w/cpp/numeric/popcount However, I prefer this answer because it also handles signed values. `:)` – kevinarpe May 03 '20 at 11:12
  • You might also see this talk by Matt Godbolt, which touches on exactly the same topic: https://youtu.be/bSkpMdDe4g4?t=2378 – Thomas Nov 12 '20 at 13:45
  • 1
    @Thomas: Yup, I've seen it thanks. I left a comment on the video years ago (https://www.youtube.com/watch?v=bSkpMdDe4g4&lc=UgzcqUg6XD68lgitlKR4AaABAg), and linked it in prominently in [How to remove "noise" from GCC/clang assembly output?](https://stackoverflow.com/q/38552116). But yes, highly recommend it to future readers. – Peter Cordes Nov 12 '20 at 19:42
  • 1
    I was rather disappointed to find that Python's `bit_count` wasn't introduced until 3.10. I'm stuck at 3.8 so I can't use it. – Mark Ransom Oct 31 '21 at 06:33
211

In my opinion, the "best" solution is the one that can be read by another programmer (or the original programmer two years later) without copious comments. You may well want the fastest or cleverest solution which some have already provided but I prefer readability over cleverness any time.

unsigned int bitCount (unsigned int value) {
    unsigned int count = 0;
    while (value > 0) {           // until all bits are zero
        if ((value & 1) == 1)     // check lower bit
            count++;
        value >>= 1;              // shift bits, removing lower bit
    }
    return count;
}

If you want more speed (and assuming you document it well to help out your successors), you could use a table lookup:

// Lookup table for fast calculation of bits set in 8-bit unsigned char.

static unsigned char oneBitsInUChar[] = {
//  0  1  2  3  4  5  6  7  8  9  A  B  C  D  E  F (<- n)
//  =====================================================
    0, 1, 1, 2, 1, 2, 2, 3, 1, 2, 2, 3, 2, 3, 3, 4, // 0n
    1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5, // 1n
    : : :
    4, 5, 5, 6, 5, 6, 6, 7, 5, 6, 6, 7, 6, 7, 7, 8, // Fn
};

// Function for fast calculation of bits set in 16-bit unsigned short.

unsigned char oneBitsInUShort (unsigned short x) {
    return oneBitsInUChar [x >>    8]
         + oneBitsInUChar [x &  0xff];
}

// Function for fast calculation of bits set in 32-bit unsigned int.

unsigned char oneBitsInUInt (unsigned int x) {
    return oneBitsInUShort (x >>     16)
         + oneBitsInUShort (x &  0xffff);
}

These rely on specific data type sizes so they're not that portable. But, since many performance optimisations aren't portable anyway, that may not be an issue. If you want portability, I'd stick to the readable solution.

paxdiablo
  • 854,327
  • 234
  • 1,573
  • 1,953
  • 23
    Instead of dividing by 2 and commenting it as "shift bits...", you should just use the shift operator (>>) and leave out the comment. – indiv Sep 25 '08 at 03:42
  • 12
    wouldn't it make more sense to replace `if ((value & 1) == 1) { count++; }` with `count += value & 1`? – Ponkadoodle Apr 25 '10 at 19:04
  • 28
    No, the best solution isn't the one most readable in this case. Here the best algorithm is the fastest one. – NikiC Sep 23 '10 at 07:55
  • 26
    That's entirely your opinion, @nikic, although you're free to downvote me, obviously. There was no mention in the question as to how to quantify "best", the words "performance" or "fast" can be seen nowhere. That's why I opted for readable. – paxdiablo Sep 23 '10 at 08:57
  • @paxdiablo: Right as you are I changed my vote. I haven't read the question carefully enough. (I had to make a ghost-edit so I can change the vote. Hopefully you're okay with that.) – NikiC Sep 23 '10 at 10:34
  • 3
    I am reading this answer 3 years later, and I find it as the best answer because it is readable and has more comments. period. – waka-waka-waka Oct 30 '13 at 04:09
  • 3
    Four years later and this is the one I'd use. Mainly because I can understand it enough to replicate it. Good answer. – tom Oct 16 '14 at 11:42
  • 1
    Sorry but lookup tables are slow. – Sambatyon Feb 23 '15 at 13:45
  • 2
    @Sambatyon, I'd take your contention more seriously if you provided actual support for it. – paxdiablo Feb 23 '15 at 21:20
  • Much better than all the other answers. Clear, straightforward and comprehensible by humans without the aid of a debugger. – C.J. Jul 01 '16 at 13:17
  • in case of big lookup tables it might really be slower than runtime computation because of CPU cache miss, but these ones do not look big – Den Roman Dec 09 '16 at 16:14
  • 1
    Readability? `int cnt=0; for( int b=0; b < 32; b++) if( (val>>b) & 1) cnt++;` But I think this question isn't that interesting until speed is a priority. – greggo Mar 22 '17 at 15:42
  • 2
    Downvote. Readability is NOT an issue in unifunctional leaf code that needs no maintenance. Performance IS. – Aiken Drum Mar 01 '18 at 06:17
  • 2
    @AikenDrum, you're entitled to your opinion but mine is that pretty much *all* code eventually needs maintenance of some description. I pray I never have to maintain yours :-) – paxdiablo Mar 01 '18 at 08:54
  • If you had to read any of my code that was written for perf over readability, you would find it well-commented to compensate. If your LoB involves maximum throughput, you don't sacrifice the bottom line and a lot of customers' time to save a single future programmer time figuring out (or reading) how your algorithm counted the number of set bits in a value. Have you not worked in more than one level of code, or more than one job? No one should think that such an exceedingly general rule of thumb can be applied in all cases. Learn some decorum before you go insulting people's skills on SE. – Aiken Drum Mar 01 '18 at 10:53
  • I would suspect the bottleneck in the highly readable version would be the if branch, which would be unpredictable. Ponkadoodle's suggestion is almost as readable as the original, but removes the branch -- low hanging fruit that's still easy to read should be picked, IMHO. But anyway, I'm glad there's an answer that prioritises readability, even if that's not what I'd look for with a problem like this. – Jibb Smart Mar 01 '18 at 23:56
  • 1
    @AikenDrum, there is absolutely *nothing* in the question (or question history for that matter) that asks for performance. An earlier version asked for the best way (unwise since "best" was not defined) and my opinion has always been that readability trumps performance unless there are *specific* requirements. You should also learn to read smileys, that comment about maintaining your code was a gentle jab rather than an intended insult. My regret at doing so is tempered by the fact you appear to have retaliated in kind, implying a lack of experience on my part but I don't get offended by such. – paxdiablo Mar 02 '18 at 01:42
  • 4
    I made it quite clear that the answer was heavily based on my opinions and preferences, and I saw little need to rehash the performance-based answers already in play. Earlier comments have alredy covered this ground and, apparently, 172 people agreed though, admittedly, that's four fifths of bugger-all of the SO community so may not carry *that* much weight :-) Bottom line, I have no issue with the way you vote, I just wanted to make sure you et al at least understood *why* I gave the answer I gave. You may have the last word if you wish, I think I've explained as best I can. – paxdiablo Mar 02 '18 at 01:45
  • Note that the function bitCount() cannot be used for signed int because if the sign bit is set, the while loop does never end because in some implementations, the sign bit is added instead of zeroes. – Fabian Jun 29 '18 at 21:35
  • @Fabian, hence the `unsigned` in the function prototype :-) – paxdiablo Jun 29 '18 at 22:41
  • i want to upvote this question SOLELY because it elicited the comment from @JohannesSchaub-litb but refraining since the passing neo-luddite would think i'm upvoting on merits. – user2297550 Jul 07 '18 at 07:44
  • 3
    even if i take pains to see your viewpoint, here is the problem: You state "the one that can be read by another programmer (or the original programmer two years later)" and then you write `while (value > 0) ..` and `if ((value & 1) == 1) ..` That implies that by "another programmer" you truly mean "another programmer of another language" or "original programmer who forgot C in the meanwhile" while tempting the casual reader to think you are advocating simpler algorithms. It is IDIOMATIC & *easier* to read `while (value) ..` and `if (value & 1) ..` Your true position is "code as would morons". – user2297550 Jul 07 '18 at 07:55
  • 3
    `unsigned int count_bits (unsigned int val) { unsigned int n; for (n = 0; val; val >>= 1) n += val & 1; return n; }` There .. was that so hard? And I stopped coding in C about 15 years ago, and it's easier to read. Going by your reputation the only rationale I can think of for your original answer is that you were trolling and establishing some kind of a moron honeypot to warn everybody else, lol. – user2297550 Jul 07 '18 at 08:03
  • ^^^ Overzelous programmers. – BathNinja Sep 07 '20 at 21:55
  • To me, `count += (value & 1); // add the low bit` is more readable than using an `if` around an increment. Or I guess you could say it's equally readable but a different conceptual algorithm: *add* all the bits together without their place-values, vs. counting occurrences. – Peter Cordes Aug 28 '23 at 19:04
  • I have to agree with @user2297550's comment, though, that `if (value & 1)` is more readable since there's less going on than `if ((value & 1) == 1)`. And especially the `value > 0` loop condition is non-obvious vs. `value != 0`. Maybe that was to match with your original `/= 2` division, to cast this as a math problem on arbitrary numbers without any specific base being special, rather than a bit-manipulation exercise using binary numbers in a language where integers are inherently already binary? But still, `!= 0` makes at least as much sense, and doesn't require thinking through `unsigned` – Peter Cordes Aug 28 '23 at 19:05
106

From Hacker's Delight, p. 66, Figure 5-2

int pop(unsigned x)
{
    x = x - ((x >> 1) & 0x55555555);
    x = (x & 0x33333333) + ((x >> 2) & 0x33333333);
    x = (x + (x >> 4)) & 0x0F0F0F0F;
    x = x + (x >> 8);
    x = x + (x >> 16);
    return x & 0x0000003F;
}

Executes in ~20-ish instructions (arch dependent), no branching.

Hacker's Delight is delightful! Highly recommended.

Pete
  • 1,305
  • 1
  • 12
  • 36
Kevin Little
  • 12,436
  • 5
  • 39
  • 47
  • 8
    The Java method `Integer.bitCount(int)` uses this same exact implementation. – Marco Bolis Jan 05 '15 at 16:33
  • Having a little trouble following this - how would it change if we only cared about 16-bit values, instead of 32-bit? – Jeremy Blum Feb 24 '15 at 07:23
  • 1
    Maybe hackers delight is delightful, but I would give a good kicking to anybody calling this `pop` instead of `population_count` (or `pop_cnt` if you must have an abreviation). @MarcoBolis I presume that will be true of all versions of Java, but officially that would be implementation dependent :) – Maarten Bodewes Mar 18 '15 at 18:51
  • And, this requires no multiplications, like the code in the accepted answer. – Alex Jul 07 '17 at 13:23
  • Note that in generalizing to 64-bit there is a problem. The result cannot be 64, because of the mask. – Albert van der Horst Mar 05 '19 at 10:56
  • @Alex: The first part of this is the same bithack as the accepted answer, it simply uses multiple two shift+add steps to horizontally sum 4 bytes into 1, instead of one multiply by a `0x01010101` constant to sum all 4 bytes into the high byte. (I would have done the `x>>16` part first, to make the reduction work by narrowing the valuable part in half twice.) On a CPU without a fast multiplier, yes this is preferable (especially if it can do large shifts efficiently, not like 1 bit per cycle. Or if it has some other way to extract the high half; e.g. on a 16-bit CPU it would already be split – Peter Cordes Mar 13 '21 at 03:45
84

I think the fastest way—without using lookup tables and popcount—is the following. It counts the set bits with just 12 operations.

int popcount(int v) {
    v = v - ((v >> 1) & 0x55555555);                // put count of each 2 bits into those 2 bits
    v = (v & 0x33333333) + ((v >> 2) & 0x33333333); // put count of each 4 bits into those 4 bits  
    return ((v + (v >> 4) & 0xF0F0F0F) * 0x1010101) >> 24;
}

It works because you can count the total number of set bits by dividing in two halves, counting the number of set bits in both halves and then adding them up. Also know as Divide and Conquer paradigm. Let's get into detail..

v = v - ((v >> 1) & 0x55555555); 

The number of bits in two bits can be 0b00, 0b01 or 0b10. Lets try to work this out on 2 bits..

 ---------------------------------------------
 |   v    |   (v >> 1) & 0b0101   |  v - x   |
 ---------------------------------------------
   0b00           0b00               0b00   
   0b01           0b00               0b01     
   0b10           0b01               0b01
   0b11           0b01               0b10

This is what was required: the last column shows the count of set bits in every two bit pair. If the two bit number is >= 2 (0b10) then and produces 0b01, else it produces 0b00.

v = (v & 0x33333333) + ((v >> 2) & 0x33333333); 

This statement should be easy to understand. After the first operation we have the count of set bits in every two bits, now we sum up that count in every 4 bits.

v & 0b00110011         //masks out even two bits
(v >> 2) & 0b00110011  // masks out odd two bits

We then sum up the above result, giving us the total count of set bits in 4 bits. The last statement is the most tricky.

c = ((v + (v >> 4) & 0xF0F0F0F) * 0x1010101) >> 24;

Let's break it down further...

v + (v >> 4)

It's similar to the second statement; we are counting the set bits in groups of 4 instead. We know—because of our previous operations—that every nibble has the count of set bits in it. Let's look an example. Suppose we have the byte 0b01000010. It means the first nibble has its 4bits set and the second one has its 2bits set. Now we add those nibbles together.

v = 0b01000010
(v >> 4) = 0b00000100
v + (v >> 4) = 0b01000010 + 0b00000100

It gives us the count of set bits in a byte, in the second nibble 0b01000110 and therefore we mask the first four bytes of all the bytes in the number (discarding them).

0b01000110 & 0x0F = 0b00000110

Now every byte has the count of set bits in it. We need to add them up all together. The trick is to multiply the result by 0b10101010 which has an interesting property. If our number has four bytes, A B C D, it will result in a new number with these bytes A+B+C+D B+C+D C+D D. A 4 byte number can have maximum of 32 bits set, which can be represented as 0b00100000.

All we need now is the first byte which has the sum of all set bits in all the bytes, and we get it by >> 24. This algorithm was designed for 32 bit words but can be easily modified for 64 bit words.

A1.exe
  • 331
  • 2
  • 10
vidit
  • 6,293
  • 3
  • 32
  • 50
  • 1
    What is the `c = ` about? Looks like is should be eliminated. Further, suggest an extra paren set A"(((v + (v >> 4)) & 0xF0F0F0F) * 0x1010101) >> 24" to avoid some classic warnings. – chux - Reinstate Monica Oct 15 '13 at 15:40
  • 4
    An important feature is that this 32-bit routine works for both `popcount(int v)` and `popcount(unsigned v)`. For portability, consider `popcount(uint32_t v)`, etc. Really like the *0x1010101 part. – chux - Reinstate Monica Oct 15 '13 at 15:49
  • sauce ? (book, link, invetors's names etc) would be VERY welcomed. Because then we can paste that in our codebases with a comment to where it comes from. – v.oddou Mar 31 '15 at 01:34
  • 2
    I think for better clarity the last line should be written as: `return (((i + (i >> 4)) & 0x0F0F0F0F) * 0x01010101) >> 24;` so we don't need to count letters to see what you are actually doing (since you discarded the first `0`, I accidentally thought you used the wrong (flipped) bit pattern as mask - that is until I noted there are only 7 letters and not 8). – emem Feb 06 '16 at 09:02
  • That **multiplication** by 0x01010101 might be slow, depending on processor. For example, in my old PowerBook G4, 1 multiplication was about as slow as 4 additions (not as bad as division, where 1 division was about as slow as 23 additions). – George Koehler Jan 03 '18 at 03:05
  • @GeorgeKoehler: Multiply usually has higher *latency* than addition, but for throughput it's often still a single slot in the pipeline. So out-of-order exec can still overlap this with the surrounding code. But the alternative for summing 4 bytes into 1 is two shifts and 2 adds, [as in the Hacker's Delight answer](https://stackoverflow.com/questions/109023/how-to-count-the-number-of-set-bits-in-a-32-bit-integer#comment117750365_109117) which is the same bithack as this and the accepted answer, up until the final reduction step. – Peter Cordes Mar 13 '21 at 03:50
  • This would be clearer if the `0x0F0F0F0F` was on a separate line, like `v = v + (v >> 4) & 0xF0F0F0F; // sum nibbles to bytes`. There's no benefit or clarity to jamming it onto the same line as the bytes -> int horizontal sum. It would also make the `c = ...` stray bit of code stand out more :P – Peter Cordes Mar 13 '21 at 03:52
63

If you happen to be using Java, the built-in method Integer.bitCount will do that.

Marco Bolis
  • 1,222
  • 2
  • 15
  • 31
Noether
  • 199
  • 1
  • 3
  • When sun provided different APIs, it must be using some logic on background, right? – Vallabh Patade Apr 12 '13 at 10:19
  • 2
    As a side note, Java's implementation uses the **same** algorithm pointed out by [Kevin Little](http://stackoverflow.com/questions/109023/how-to-count-the-number-of-set-bits-in-a-32-bit-integer#109117). – Marco Bolis Jan 05 '15 at 16:37
  • 4
    Implementation aside, this is probably the clearest message of intent for developers maintaining your code after you (or when you come back to it 6 months later) – divillysausages May 04 '17 at 10:08
59

I got bored, and timed a billion iterations of three approaches. Compiler is gcc -O3. CPU is whatever they put in the 1st gen Macbook Pro.

Fastest is the following, at 3.7 seconds:

static unsigned char wordbits[65536] = { bitcounts of ints between 0 and 65535 };
static int popcount( unsigned int i )
{
    return( wordbits[i&0xFFFF] + wordbits[i>>16] );
}

Second place goes to the same code but looking up 4 bytes instead of 2 halfwords. That took around 5.5 seconds.

Third place goes to the bit-twiddling 'sideways addition' approach, which took 8.6 seconds.

Fourth place goes to GCC's __builtin_popcount(), at a shameful 11 seconds.

The counting one-bit-at-a-time approach was waaaay slower, and I got bored of waiting for it to complete.

So if you care about performance above all else then use the first approach. If you care, but not enough to spend 64Kb of RAM on it, use the second approach. Otherwise use the readable (but slow) one-bit-at-a-time approach.

It's hard to think of a situation where you'd want to use the bit-twiddling approach.

Edit: Similar results here.

  • 54
    @Mike, The table based approach is unbeatable if the table is in the cache. This happens in micro-benchmarks (e.g. do millions of tests in a tight loop). However, a cache miss takes around 200 cycles, and even the most naive popcount will be faster here. It always depends on the application. – Nils Pipenbrinck Sep 25 '08 at 04:42
  • 10
    If you're not calling this routine a few million times in a tight loop then you have no reason to care about it's performance at all, and might as well use the naive-but-readable approach since the performance loss will be negligible. And FWIW, the 8bit LUT gets cache-hot within 10-20 calls. –  Sep 25 '08 at 11:02
  • 8
    I don't think it's all that hard to imagine a situation where this is a leaf call made from the method -actually doing the heavy lifting- in your app. Depending on what else is going on (and threading) the smaller version could win. Lots of algorithms have been written that beat their peers due to better locality of reference. Why not this too? – Jason May 06 '10 at 08:50
  • Try this with clang, it's _significantly_ smarter at implementing builtins. – Matt Joiner Oct 05 '10 at 05:23
  • Fine with UInt32, but not feasible with UInt64. – HTTP 410 Nov 22 '11 at 12:11
  • 3
    GCC won't emit popcont instruction unless called with -msse4.2, case which is faster than 'sideways addition'. – lvella Jul 07 '12 at 06:02
  • What about looking up 1 word? – HelloGoodbye Apr 01 '16 at 09:30
  • 16-bit chunks are too big; a 64kiB table will miss a lot. 11-bit chunks let you do a 32-bit integer in 3 chunks and "only" 2048 bytes. That's still a lot, but small enough that parts of it may stay hot in cache, and/or are less likely to evict as much other valuable data that the rest of your program will use later. (right-shift by 11 and AND; the last chunks will definitely have its top bit cleared but that's fine.) – Peter Cordes Mar 13 '21 at 03:56
40
unsigned int count_bit(unsigned int x)
{
  x = (x & 0x55555555) + ((x >> 1) & 0x55555555);
  x = (x & 0x33333333) + ((x >> 2) & 0x33333333);
  x = (x & 0x0F0F0F0F) + ((x >> 4) & 0x0F0F0F0F);
  x = (x & 0x00FF00FF) + ((x >> 8) & 0x00FF00FF);
  x = (x & 0x0000FFFF) + ((x >> 16)& 0x0000FFFF);
  return x;
}

Let me explain this algorithm.

This algorithm is based on Divide and Conquer Algorithm. Suppose there is a 8bit integer 213(11010101 in binary), the algorithm works like this(each time merge two neighbor blocks):

+-------------------------------+
| 1 | 1 | 0 | 1 | 0 | 1 | 0 | 1 |  <- x
|  1 0  |  0 1  |  0 1  |  0 1  |  <- first time merge
|    0 0 1 1    |    0 0 1 0    |  <- second time merge
|        0 0 0 0 0 1 0 1        |  <- third time ( answer = 00000101 = 5)
+-------------------------------+
abcdabcd987
  • 2,043
  • 2
  • 23
  • 34
  • 8
    This algorithm is the version Matt Howells posted, before being optimized to the fact that it became unreadable. – Lefteris E Jun 13 '13 at 10:06
  • @LefterisE: Except using shift/add all the way to the end, instead of using a multiply to sum 8-bit chunks into the top 8 bits, replacing the last 2 `x=...` lines here. My edits demangled it some (keeping the optimized logic, improving readability and adding comments), and I added a section that explains the SWAR bithacks involved. This answer is still useful, though, at least for the diagram. Or for a hypothetical 32-bit machine with a slow multiply instruction. – Peter Cordes Jun 18 '22 at 02:49
30

Why not iteratively divide by 2?

count = 0
while n > 0
  if (n % 2) == 1
    count += 1
  n /= 2  

I agree that this isn't the fastest, but "best" is somewhat ambiguous. I'd argue though that "best" should have an element of clarity

flori
  • 14,339
  • 4
  • 56
  • 63
daniel
  • 2,568
  • 24
  • 32
  • That'll work and is easy to understand, but there are faster methods. – Matt Howells Sep 20 '08 at 19:40
  • 2
    Unless you do this a *LOT*, the performance impact would be negligible. So all things being equal, I agree with daniel that 'best' implies "doesn't read like gibberish". –  Sep 20 '08 at 21:50
  • 2
    I deliberately didn't define 'best', to get a variety of methods. Lets face it if we have got down to the level of this sort of bit-twiddling we are probably looking for something uber-fast that looks like a chimp has typed it. – Matt Howells Sep 21 '08 at 17:47
  • 6
    Bad code. A compiler might make good one out of it, but in my tests GCC did not. Replace (n%2) with (n&1); AND being much faster than MODULO. Replace (n/=2) with (n>>=1); bitshifting much faster than division. – Mecki Sep 25 '08 at 10:35
  • 6
    @Mecki: In my tests, gcc (4.0, -O3) *did* do the obvious optimisations. –  Sep 25 '08 at 13:32
  • 1
    Negative values of `n` always return 0. – chux - Reinstate Monica Oct 15 '13 at 15:22
  • Regarding optimization of %2 and /2: `n%2` and `n/2` are only equivalent to `n&1` and `n>>1` when `n >=0`. So, with `unsigned` it's fine (and the example is only really correct with `unsigned`, anyway). In this example, even with int, the compiler could prove that those ops are never reached unless `n>=0`, so it's not a reflection on the general case; if compiler can't prove `n>=0`, you will get something like `(n -(n>>31))>>1` for the n/2, and similar weirdness for the n%2. In general case, `if( (n%2)!=0)` could optimize to `(n&1)!=0` but `if( (n%2)==1)` cannot. – greggo Mar 28 '17 at 19:03
  • The 2 statements within the loop (namely, "if (n % 2) == 1" and "count += 1") could be replaced by a single, faster one: "count += (n % 2) == 1", leveraging the C++ support to implicit cast between bool and int. Interesting answer anyway. Upvoted. – Ronald Souza Sep 15 '19 at 06:52
  • `n%2 == 1` will be false for all negative `n`, in language like C where integer division truncates toward zero so `-3 % 2 == -1`. This might work in Python where integer division works differently; based on syntax using only indenting for nesting, it might well be Python, but that (and the reason it works) are not stated in the answer. Using division / remainder makes sense semantically when looking at collections of bits as numbers, but here we really do want to look at it as a collection of bits so shift and AND is most natural. Using arithmetic is an obfuscation. – Peter Cordes Sep 10 '22 at 01:32
30

This is one of those questions where it helps to know your micro-architecture. I just timed two variants under gcc 4.3.3 compiled with -O3 using C++ inlines to eliminate function call overhead, one billion iterations, keeping the running sum of all counts to ensure the compiler doesn't remove anything important, using rdtsc for timing (clock cycle precise).

inline int pop2(unsigned x, unsigned y)
{
    x = x - ((x >> 1) & 0x55555555);
    y = y - ((y >> 1) & 0x55555555);
    x = (x & 0x33333333) + ((x >> 2) & 0x33333333);
    y = (y & 0x33333333) + ((y >> 2) & 0x33333333);
    x = (x + (x >> 4)) & 0x0F0F0F0F;
    y = (y + (y >> 4)) & 0x0F0F0F0F;
    x = x + (x >> 8);
    y = y + (y >> 8);
    x = x + (x >> 16);
    y = y + (y >> 16);
    return (x+y) & 0x000000FF;
}

The unmodified Hacker's Delight took 12.2 gigacycles. My parallel version (counting twice as many bits) runs in 13.0 gigacycles. 10.5s total elapsed for both together on a 2.4GHz Core Duo. 25 gigacycles = just over 10 seconds at this clock frequency, so I'm confident my timings are right.

This has to do with instruction dependency chains, which are very bad for this algorithm. I could nearly double the speed again by using a pair of 64-bit registers. In fact, if I was clever and added x+y a little sooner I could shave off some shifts. The 64-bit version with some small tweaks would come out about even, but count twice as many bits again.

With 128 bit SIMD registers, yet another factor of two, and the SSE instruction sets often have clever short-cuts, too.

There's no reason for the code to be especially transparent. The interface is simple, the algorithm can be referenced on-line in many places, and it's amenable to comprehensive unit test. The programmer who stumbles upon it might even learn something. These bit operations are extremely natural at the machine level.

OK, I decided to bench the tweaked 64-bit version. For this one sizeof(unsigned long) == 8

inline int pop2(unsigned long x, unsigned long y)
{
    x = x - ((x >> 1) & 0x5555555555555555);
    y = y - ((y >> 1) & 0x5555555555555555);
    x = (x & 0x3333333333333333) + ((x >> 2) & 0x3333333333333333);
    y = (y & 0x3333333333333333) + ((y >> 2) & 0x3333333333333333);
    x = (x + (x >> 4)) & 0x0F0F0F0F0F0F0F0F;
    y = (y + (y >> 4)) & 0x0F0F0F0F0F0F0F0F;
    x = x + y; 
    x = x + (x >> 8);
    x = x + (x >> 16);
    x = x + (x >> 32); 
    return x & 0xFF;
}

That looks about right (I'm not testing carefully, though). Now the timings come out at 10.70 gigacycles / 14.1 gigacycles. That later number summed 128 billion bits and corresponds to 5.9s elapsed on this machine. The non-parallel version speeds up a tiny bit because I'm running in 64-bit mode and it likes 64-bit registers slightly better than 32-bit registers.

Let's see if there's a bit more OOO pipelining to be had here. This was a bit more involved, so I actually tested a bit. Each term alone sums to 64, all combined sum to 256.

inline int pop4(unsigned long x, unsigned long y, 
                unsigned long u, unsigned long v)
{
  enum { m1 = 0x5555555555555555, 
         m2 = 0x3333333333333333, 
         m3 = 0x0F0F0F0F0F0F0F0F, 
         m4 = 0x000000FF000000FF };

    x = x - ((x >> 1) & m1);
    y = y - ((y >> 1) & m1);
    u = u - ((u >> 1) & m1);
    v = v - ((v >> 1) & m1);
    x = (x & m2) + ((x >> 2) & m2);
    y = (y & m2) + ((y >> 2) & m2);
    u = (u & m2) + ((u >> 2) & m2);
    v = (v & m2) + ((v >> 2) & m2);
    x = x + y; 
    u = u + v; 
    x = (x & m3) + ((x >> 4) & m3);
    u = (u & m3) + ((u >> 4) & m3);
    x = x + u; 
    x = x + (x >> 8);
    x = x + (x >> 16);
    x = x & m4; 
    x = x + (x >> 32);
    return x & 0x000001FF;
}

I was excited for a moment, but it turns out gcc is playing inline tricks with -O3 even though I'm not using the inline keyword in some tests. When I let gcc play tricks, a billion calls to pop4() takes 12.56 gigacycles, but I determined it was folding arguments as constant expressions. A more realistic number appears to be 19.6gc for another 30% speed-up. My test loop now looks like this, making sure each argument is different enough to stop gcc from playing tricks.

   hitime b4 = rdtsc(); 
   for (unsigned long i = 10L * 1000*1000*1000; i < 11L * 1000*1000*1000; ++i) 
      sum += pop4 (i,  i^1, ~i, i|1); 
   hitime e4 = rdtsc(); 

256 billion bits summed in 8.17s elapsed. Works out to 1.02s for 32 million bits as benchmarked in the 16-bit table lookup. Can't compare directly, because the other bench doesn't give a clock speed, but looks like I've slapped the snot out of the 64KB table edition, which is a tragic use of L1 cache in the first place.

Update: decided to do the obvious and create pop6() by adding four more duplicated lines. Came out to 22.8gc, 384 billion bits summed in 9.5s elapsed. So there's another 20% Now at 800ms for 32 billion bits.

user183351
  • 161
  • 2
  • 3
  • 2
    The best non-assembler form like this I've seen unrolled 24 32bit words at a time. http://dalkescientific.com/writings/diary/popcnt.c, http://stackoverflow.com/questions/3693981/bit-popcount-for-large-buffer-assembly-preferred, http://www.dalkescientific.com/writings/diary/archive/2008/07/05/bitslice_and_popcount.html – Matt Joiner Oct 05 '10 at 05:25
27

The Hacker's Delight bit-twiddling becomes so much clearer when you write out the bit patterns.

unsigned int bitCount(unsigned int x)
{
  x = ((x >> 1) & 0b01010101010101010101010101010101)
     + (x       & 0b01010101010101010101010101010101);
  x = ((x >> 2) & 0b00110011001100110011001100110011)
     + (x       & 0b00110011001100110011001100110011); 
  x = ((x >> 4) & 0b00001111000011110000111100001111)
     + (x       & 0b00001111000011110000111100001111); 
  x = ((x >> 8) & 0b00000000111111110000000011111111)
     + (x       & 0b00000000111111110000000011111111); 
  x = ((x >> 16)& 0b00000000000000001111111111111111)
     + (x       & 0b00000000000000001111111111111111); 
  return x;
}

The first step adds the even bits to the odd bits, producing a sum of bits in each two. The other steps add high-order chunks to low-order chunks, doubling the chunk size all the way up, until we have the final count taking up the entire int.

0xF
  • 3,214
  • 1
  • 25
  • 29
John Dimm
  • 91
  • 2
  • 5
  • 3
    This solution seem to have minor problem, related to operator precedence. For each term it should say: x = (((x >> 1) & 0b01010101010101010101010101010101) + (x & 0b01010101010101010101010101010101)); (i.e. extra parens added). – Nopik Aug 22 '14 at 07:38
  • 1
    In case you're confused, the error in the original article that @Nopik pointed out has since been fixed (by someone else), and without newly introducing *extraneous* parentheses as the comment suggests. – Glenn Slayden Jan 18 '22 at 03:33
23

For a happy medium between a 232 lookup table and iterating through each bit individually:

int bitcount(unsigned int num){
    int count = 0;
    static int nibblebits[] =
        {0, 1, 1, 2, 1, 2, 2, 3, 1, 2, 2, 3, 2, 3, 3, 4};
    for(; num != 0; num >>= 4)
        count += nibblebits[num & 0x0f];
    return count;
}

From http://ctips.pbwiki.com/CountBits

PhirePhly
  • 2,908
  • 1
  • 17
  • 12
  • Not portable. What if the CPU has 9 bit bytes? Yes, there are real CPU's like that out there... – Robert S. Barnes Mar 29 '11 at 08:06
  • 18
    @Robert S. Barnes, this function will still work. It makes no assumption about native word size, and no reference to "bytes" at all. – finnw May 08 '11 at 11:26
  • Is the complexity of this code ``O(floor(log2(num))/4)``, assuming ``num`` can be as arbitrarily large as possible? Because the ``while`` loop runs as long as there's a nibble to process? There are ``floor(log2(num))`` bits and ``floor(log2(num)) / 4`` nibbles. Is the reasoning correct? – Robur_131 Jun 12 '20 at 19:28
  • @Robur_131 I don't see anything wrong with your reasoning, except that big-O doesn't care about constant factors so you could simplify to just O(log n). The nice thing about this algorithm is it doesn't always take worst case, if the upper bits are zero it exits early. In fact for an input of zero the loop doesn't run at all. – Mark Ransom May 16 '21 at 15:55
  • @MarkRansom: So unless you're optimizing for inputs of zero, you should probably change it to a `do{}while` loop that doesn't compare/branch until after checking the low 4 bits. Or unroll to check 2x 4 bits on the first iteration. That means the branch-prediction pattern is the same for all inputs from 0..255, and making the branching more predictable is often a Good Thing, when the work saved is much cheaper than a branch mispredict. (Of course that depends on the CPU, and you wouldn't typically use this on a high-end CPU where branch misses are most expensive.) – Peter Cordes Jun 18 '22 at 02:55
19

This can be done in O(k), where k is the number of bits set.

int NumberOfSetBits(int n)
{
    int count = 0;

    while (n){
        ++ count;
        n = (n - 1) & n;
    }

    return count;
}
herohuyongtao
  • 49,413
  • 29
  • 133
  • 174
  • 1
    This is essentially **Brian Kernighan's** (remember him?) algorithm, with the minor change that he used the more succinct `n &= (n-1)` form. – Adrian Mole Aug 23 '19 at 13:05
17

It's not the fastest or best solution, but I found the same question in my way, and I started to think and think. finally I realized that it can be done like this if you get the problem from mathematical side, and draw a graph, then you find that it's a function which has some periodic part, and then you realize the difference between the periods... so here you go:

unsigned int f(unsigned int x)
{
    switch (x) {
        case 0:
            return 0;
        case 1:
            return 1;
        case 2:
            return 1;
        case 3:
            return 2;
        default:
            return f(x/4) + f(x%4);
    }
}
Andro Selva
  • 53,910
  • 52
  • 193
  • 240
Peter
  • 71
  • 1
  • 2
  • 4
    oh i like that. how bout the python version: `def f(i, d={0:lambda:0, 1:lambda:1, 2:lambda:1, 3:lambda:2}): return d.get(i, lambda: f(i//4) + f(i%4))()` – underrun Feb 01 '13 at 19:04
11

I think the Brian Kernighan's method will be useful too... It goes through as many iterations as there are set bits. So if we have a 32-bit word with only the high bit set, then it will only go once through the loop.

int countSetBits(unsigned int n) { 
    unsigned int n; // count the number of bits set in n
    unsigned int c; // c accumulates the total bits set in n
    for (c=0;n>0;n=n&(n-1)) c++; 
    return c; 
}

Published in 1988, the C Programming Language 2nd Ed. (by Brian W. Kernighan and Dennis M. Ritchie) mentions this in exercise 2-9. On April 19, 2006 Don Knuth pointed out to me that this method "was first published by Peter Wegner in CACM 3 (1960), 322. (Also discovered independently by Derrick Lehmer and published in 1964 in a book edited by Beckenbach.)"

jan.supol
  • 2,636
  • 1
  • 26
  • 31
Erorr
  • 1
  • 1
  • 4
10

The function you are looking for is often called the "sideways sum" or "population count" of a binary number. Knuth discusses it in pre-Fascicle 1A, pp11-12 (although there was a brief reference in Volume 2, 4.6.3-(7).)

The locus classicus is Peter Wegner's article "A Technique for Counting Ones in a Binary Computer", from the Communications of the ACM, Volume 3 (1960) Number 5, page 322. He gives two different algorithms there, one optimized for numbers expected to be "sparse" (i.e., have a small number of ones) and one for the opposite case.

Richard
  • 56,349
  • 34
  • 180
  • 251
Michael Dorfman
  • 4,060
  • 1
  • 22
  • 29
10
private int get_bits_set(int v)
{
    int c; // 'c' accumulates the total bits set in 'v'
    for (c = 0; v>0; c++)
    {
        v &= v - 1; // Clear the least significant bit set
    }
    return c;
}
Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
stacktay
  • 81
  • 2
  • 3
9

Few open questions:-

  1. If the number is negative then?
  2. If the number is 1024 , then the "iteratively divide by 2" method will iterate 10 times.

we can modify the algo to support the negative number as follows:-

count = 0
while n != 0
if ((n % 2) == 1 || (n % 2) == -1
    count += 1
  n /= 2  
return count

now to overcome the second problem we can write the algo like:-

int bit_count(int num)
{
    int count=0;
    while(num)
    {
        num=(num)&(num-1);
        count++;
    }
    return count;
}

for complete reference see :

http://goursaha.freeoda.com/Miscellaneous/IntegerBitCount.html

osgx
  • 90,338
  • 53
  • 357
  • 513
Baban
  • 11
  • 1
  • 1
8

I use the below code which is more intuitive.

int countSetBits(int n) {
    return !n ? 0 : 1 + countSetBits(n & (n-1));
}

Logic : n & (n-1) resets the last set bit of n.

P.S : I know this is not O(1) solution, albeit an interesting solution.

Manish Mulani
  • 7,125
  • 9
  • 43
  • 45
  • this is good for "sparse" numbers with a low number of bits, as it is `O(ONE-BITS)`. It is indeed O(1) since there are at most 32 one-bits. – ealfonso Jan 14 '18 at 19:55
7

if you're using C++ another option is to use template metaprogramming:

// recursive template to sum bits in an int
template <int BITS>
int countBits(int val) {
        // return the least significant bit plus the result of calling ourselves with
        // .. the shifted value
        return (val & 0x1) + countBits<BITS-1>(val >> 1);
}

// template specialisation to terminate the recursion when there's only one bit left
template<>
int countBits<1>(int val) {
        return val & 0x1;
}

usage would be:

// to count bits in a byte/char (this returns 8)
countBits<8>( 255 )

// another byte (this returns 7)
countBits<8>( 254 )

// counting bits in a word/short (this returns 1)
countBits<16>( 256 )

you could of course further expand this template to use different types (even auto-detecting bit size) but I've kept it simple for clarity.

edit: forgot to mention this is good because it should work in any C++ compiler and it basically just unrolls your loop for you if a constant value is used for the bit count (in other words, I'm pretty sure it's the fastest general method you'll find)

pentaphobe
  • 337
  • 1
  • 4
  • 9
  • Unfortunately, the bit counting isn't done in parallel, so it's probably slower. Might make a nice `constexpr` though. – geometrian Jul 03 '15 at 15:14
  • Agreed - it was a fun exercise in C++ template recursion, but definitely a pretty naïve solution. – pentaphobe Sep 25 '15 at 06:46
7

What do you means with "Best algorithm"? The shorted code or the fasted code? Your code look very elegant and it has a constant execution time. The code is also very short.

But if the speed is the major factor and not the code size then I think the follow can be faster:

       static final int[] BIT_COUNT = { 0, 1, 1, ... 256 values with a bitsize of a byte ... };
        static int bitCountOfByte( int value ){
            return BIT_COUNT[ value & 0xFF ];
        }

        static int bitCountOfInt( int value ){
            return bitCountOfByte( value ) 
                 + bitCountOfByte( value >> 8 ) 
                 + bitCountOfByte( value >> 16 ) 
                 + bitCountOfByte( value >> 24 );
        }

I think that this will not more faster for a 64 bit value but a 32 bit value can be faster.

Ahmed
  • 7,148
  • 12
  • 57
  • 96
Horcrux7
  • 23,758
  • 21
  • 98
  • 156
  • My code has 10 operation. Your code has 12 operation. Your link work with smaller arrays (5). I use 256 elements. With the caching can be a problem. But if you use it very frequently then this is not a problem. – Horcrux7 Sep 20 '08 at 21:12
  • This approach is measurably quite a bit faster than the bit-twiddling approach, as it turns out. As for using more memory, it compiles to less code and that gain is repeated every time you inline the function. So it could easily turn out to be a net win. –  Sep 25 '08 at 03:43
7

I wrote a fast bitcount macro for RISC machines in about 1990. It does not use advanced arithmetic (multiplication, division, %), memory fetches (way too slow), branches (way too slow), but it does assume the CPU has a 32-bit barrel shifter (in other words, >> 1 and >> 32 take the same amount of cycles.) It assumes that small constants (such as 6, 12, 24) cost nothing to load into the registers, or are stored in temporaries and reused over and over again.

With these assumptions, it counts 32 bits in about 16 cycles/instructions on most RISC machines. Note that 15 instructions/cycles is close to a lower bound on the number of cycles or instructions, because it seems to take at least 3 instructions (mask, shift, operator) to cut the number of addends in half, so log_2(32) = 5, 5 x 3 = 15 instructions is a quasi-lowerbound.

#define BitCount(X,Y)           \
                Y = X - ((X >> 1) & 033333333333) - ((X >> 2) & 011111111111); \
                Y = ((Y + (Y >> 3)) & 030707070707); \
                Y =  (Y + (Y >> 6)); \
                Y = (Y + (Y >> 12) + (Y >> 24)) & 077;

Here is a secret to the first and most complex step:

input output
AB    CD             Note
00    00             = AB
01    01             = AB
10    01             = AB - (A >> 1) & 0x1
11    10             = AB - (A >> 1) & 0x1

so if I take the 1st column (A) above, shift it right 1 bit, and subtract it from AB, I get the output (CD). The extension to 3 bits is similar; you can check it with an 8-row boolean table like mine above if you wish.

  • Don Gillies
systemBuilder
  • 55
  • 1
  • 3
7

You can do:

while(n){
    n = n & (n-1);
    count++;
}

The logic behind this is the bits of n-1 is inverted from rightmost set bit of n.

If n=6, i.e., 110 then 5 is 101 the bits are inverted from rightmost set bit of n.

So if we & these two we will make the rightmost bit 0 in every iteration and always go to the next rightmost set bit. Hence, counting the set bit. The worst time complexity will be O(log n) when every bit is set.

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
Varun Gusain
  • 1
  • 1
  • 2
7

C++20 std::popcount

The following proposal has been merged http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2019/p0553r4.html and should add it to a the <bit> header.

I expect the usage to be like:

#include <bit>
#include <iostream>

int main() {
    std::cout << std::popcount(0x55) << std::endl;
}

I'll give it a try when support arrives to GCC, GCC 9.1.0 with g++-9 -std=c++2a still doesn't support it.

The proposal says:

Header: <bit>

namespace std {

  // 25.5.6, counting
  template<class T>
    constexpr int popcount(T x) noexcept;

and:

template<class T>
  constexpr int popcount(T x) noexcept;

Constraints: T is an unsigned integer type (3.9.1 [basic.fundamental]).

Returns: The number of 1 bits in the value of x.

std::rotl and std::rotr were also added to do circular bit rotations: Best practices for circular shift (rotate) operations in C++

Ciro Santilli OurBigBook.com
  • 347,512
  • 102
  • 1,199
  • 985
6

I'm particularly fond of this example from the fortune file:

#define BITCOUNT(x)    (((BX_(x)+(BX_(x)>>4)) & 0x0F0F0F0F) % 255)
#define BX_(x)         ((x) - (((x)>>1)&0x77777777)
                             - (((x)>>2)&0x33333333)
                             - (((x)>>3)&0x11111111))

I like it best because it's so pretty!

6

A fast C# solution using a pre-calculated table of Byte bit counts with branching on the input size.

public static class BitCount
{
    public static uint GetSetBitsCount(uint n)
    {
        var counts = BYTE_BIT_COUNTS;
        return n <= 0xff ? counts[n]
             : n <= 0xffff ? counts[n & 0xff] + counts[n >> 8]
             : n <= 0xffffff ? counts[n & 0xff] + counts[(n >> 8) & 0xff] + counts[(n >> 16) & 0xff]
             : counts[n & 0xff] + counts[(n >> 8) & 0xff] + counts[(n >> 16) & 0xff] + counts[(n >> 24) & 0xff];
    }

    public static readonly uint[] BYTE_BIT_COUNTS =
    {
        0, 1, 1, 2, 1, 2, 2, 3, 1, 2, 2, 3, 2, 3, 3, 4,
        1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5,
        1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5,
        2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
        1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5,
        2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
        2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
        3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7,
        1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5,
        2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
        2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
        3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7,
        2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
        3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7,
        3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7,
        4, 5, 5, 6, 5, 6, 6, 7, 5, 6, 6, 7, 6, 7, 7, 8
    };
}
Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
dadhi
  • 4,807
  • 19
  • 25
  • Ironically, that table could have been created by any of the algorithms posted in this thread! Nevertheless, using tables like this means constant-time performance. Going one step further and creating a 64K translation table would therefore halve the AND, SHIFT and ADD operations necessary. An interesting subject for bit manipulators! – KRK Owner Jan 10 '16 at 23:23
  • Bigger tables can be slower (and not constant-time) due to cache issues. You can 'look up' 3 bits at a time with `(0xe994 >>(k*2))&3`,without memory access... – greggo Mar 28 '17 at 18:42
6

I found an implementation of bit counting in an array with using of SIMD instruction (SSSE3 and AVX2). It has in 2-2.5 times better performance than if it will use __popcnt64 intrinsic function.

SSSE3 version:

#include <smmintrin.h>
#include <stdint.h>

const __m128i Z = _mm_set1_epi8(0x0);
const __m128i F = _mm_set1_epi8(0xF);
//Vector with pre-calculated bit count:
const __m128i T = _mm_setr_epi8(0, 1, 1, 2, 1, 2, 2, 3, 1, 2, 2, 3, 2, 3, 3, 4);

uint64_t BitCount(const uint8_t * src, size_t size)
{
    __m128i _sum =  _mm128_setzero_si128();
    for (size_t i = 0; i < size; i += 16)
    {
        //load 16-byte vector
        __m128i _src = _mm_loadu_si128((__m128i*)(src + i));
        //get low 4 bit for every byte in vector
        __m128i lo = _mm_and_si128(_src, F);
        //sum precalculated value from T
        _sum = _mm_add_epi64(_sum, _mm_sad_epu8(Z, _mm_shuffle_epi8(T, lo)));
        //get high 4 bit for every byte in vector
        __m128i hi = _mm_and_si128(_mm_srli_epi16(_src, 4), F);
        //sum precalculated value from T
        _sum = _mm_add_epi64(_sum, _mm_sad_epu8(Z, _mm_shuffle_epi8(T, hi)));
    }
    uint64_t sum[2];
    _mm_storeu_si128((__m128i*)sum, _sum);
    return sum[0] + sum[1];
}

AVX2 version:

#include <immintrin.h>
#include <stdint.h>

const __m256i Z = _mm256_set1_epi8(0x0);
const __m256i F = _mm256_set1_epi8(0xF);
//Vector with pre-calculated bit count:
const __m256i T = _mm256_setr_epi8(0, 1, 1, 2, 1, 2, 2, 3, 1, 2, 2, 3, 2, 3, 3, 4, 
                                   0, 1, 1, 2, 1, 2, 2, 3, 1, 2, 2, 3, 2, 3, 3, 4);

uint64_t BitCount(const uint8_t * src, size_t size)
{
    __m256i _sum =  _mm256_setzero_si256();
    for (size_t i = 0; i < size; i += 32)
    {
        //load 32-byte vector
        __m256i _src = _mm256_loadu_si256((__m256i*)(src + i));
        //get low 4 bit for every byte in vector
        __m256i lo = _mm256_and_si256(_src, F);
        //sum precalculated value from T
        _sum = _mm256_add_epi64(_sum, _mm256_sad_epu8(Z, _mm256_shuffle_epi8(T, lo)));
        //get high 4 bit for every byte in vector
        __m256i hi = _mm256_and_si256(_mm256_srli_epi16(_src, 4), F);
        //sum precalculated value from T
        _sum = _mm256_add_epi64(_sum, _mm256_sad_epu8(Z, _mm256_shuffle_epi8(T, hi)));
    }
    uint64_t sum[4];
    _mm256_storeu_si256((__m256i*)sum, _sum);
    return sum[0] + sum[1] + sum[2] + sum[3];
}
ErmIg
  • 3,980
  • 1
  • 27
  • 40
6

I always use this in competitive programming, and it's easy to write and is efficient:

#include <bits/stdc++.h>

using namespace std;

int countOnes(int n) {
    bitset<32> b(n);
    return b.count();
}
Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
iamdeit
  • 5,485
  • 4
  • 26
  • 40
6

Java JDK1.5

Integer.bitCount(n);

where n is the number whose 1's are to be counted.

check also,

Integer.highestOneBit(n);
Integer.lowestOneBit(n);
Integer.numberOfLeadingZeros(n);
Integer.numberOfTrailingZeros(n);

//Beginning with the value 1, rotate left 16 times
     n = 1;
         for (int i = 0; i < 16; i++) {
            n = Integer.rotateLeft(n, 1);
            System.out.println(n);
         }
zengr
  • 38,346
  • 37
  • 130
  • 192
Rahul
  • 41
  • 1
  • 1
5

Here is a portable module ( ANSI-C ) which can benchmark each of your algorithms on any architecture.

Your CPU has 9 bit bytes? No problem :-) At the moment it implements 2 algorithms, the K&R algorithm and a byte wise lookup table. The lookup table is on average 3 times faster than the K&R algorithm. If someone can figure a way to make the "Hacker's Delight" algorithm portable feel free to add it in.

#ifndef _BITCOUNT_H_
#define _BITCOUNT_H_

/* Return the Hamming Wieght of val, i.e. the number of 'on' bits. */
int bitcount( unsigned int );

/* List of available bitcount algorithms.  
 * onTheFly:    Calculate the bitcount on demand.
 *
 * lookupTalbe: Uses a small lookup table to determine the bitcount.  This
 * method is on average 3 times as fast as onTheFly, but incurs a small
 * upfront cost to initialize the lookup table on the first call.
 *
 * strategyCount is just a placeholder. 
 */
enum strategy { onTheFly, lookupTable, strategyCount };

/* String represenations of the algorithm names */
extern const char *strategyNames[];

/* Choose which bitcount algorithm to use. */
void setStrategy( enum strategy );

#endif

.

#include <limits.h>

#include "bitcount.h"

/* The number of entries needed in the table is equal to the number of unique
 * values a char can represent which is always UCHAR_MAX + 1*/
static unsigned char _bitCountTable[UCHAR_MAX + 1];
static unsigned int _lookupTableInitialized = 0;

static int _defaultBitCount( unsigned int val ) {
    int count;

    /* Starting with:
     * 1100 - 1 == 1011,  1100 & 1011 == 1000
     * 1000 - 1 == 0111,  1000 & 0111 == 0000
     */
    for ( count = 0; val; ++count )
        val &= val - 1;

    return count;
}

/* Looks up each byte of the integer in a lookup table.
 *
 * The first time the function is called it initializes the lookup table.
 */
static int _tableBitCount( unsigned int val ) {
    int bCount = 0;

    if ( !_lookupTableInitialized ) {
        unsigned int i;
        for ( i = 0; i != UCHAR_MAX + 1; ++i )
            _bitCountTable[i] =
                ( unsigned char )_defaultBitCount( i );

        _lookupTableInitialized = 1;
    }

    for ( ; val; val >>= CHAR_BIT )
        bCount += _bitCountTable[val & UCHAR_MAX];

    return bCount;
}

static int ( *_bitcount ) ( unsigned int ) = _defaultBitCount;

const char *strategyNames[] = { "onTheFly", "lookupTable" };

void setStrategy( enum strategy s ) {
    switch ( s ) {
    case onTheFly:
        _bitcount = _defaultBitCount;
        break;
    case lookupTable:
        _bitcount = _tableBitCount;
        break;
    case strategyCount:
        break;
    }
}

/* Just a forwarding function which will call whichever version of the
 * algorithm has been selected by the client 
 */
int bitcount( unsigned int val ) {
    return _bitcount( val );
}

#ifdef _BITCOUNT_EXE_

#include <stdio.h>
#include <stdlib.h>
#include <time.h>

/* Use the same sequence of pseudo random numbers to benmark each Hamming
 * Weight algorithm.
 */
void benchmark( int reps ) {
    clock_t start, stop;
    int i, j;
    static const int iterations = 1000000;

    for ( j = 0; j != strategyCount; ++j ) {
        setStrategy( j );

        srand( 257 );

        start = clock(  );

        for ( i = 0; i != reps * iterations; ++i )
            bitcount( rand(  ) );

        stop = clock(  );

        printf
            ( "\n\t%d psudoe-random integers using %s: %f seconds\n\n",
              reps * iterations, strategyNames[j],
              ( double )( stop - start ) / CLOCKS_PER_SEC );
    }
}

int main( void ) {
    int option;

    while ( 1 ) {
        printf( "Menu Options\n"
            "\t1.\tPrint the Hamming Weight of an Integer\n"
            "\t2.\tBenchmark Hamming Weight implementations\n"
            "\t3.\tExit ( or cntl-d )\n\n\t" );

        if ( scanf( "%d", &option ) == EOF )
            break;

        switch ( option ) {
        case 1:
            printf( "Please enter the integer: " );
            if ( scanf( "%d", &option ) != EOF )
                printf
                    ( "The Hamming Weight of %d ( 0x%X ) is %d\n\n",
                      option, option, bitcount( option ) );
            break;
        case 2:
            printf
                ( "Please select number of reps ( in millions ): " );
            if ( scanf( "%d", &option ) != EOF )
                benchmark( option );
            break;
        case 3:
            goto EXIT;
            break;
        default:
            printf( "Invalid option\n" );
        }

    }

 EXIT:
    printf( "\n" );

    return 0;
}

#endif
Alex North-Keys
  • 4,200
  • 1
  • 20
  • 22
Robert S. Barnes
  • 39,711
  • 30
  • 131
  • 179
  • 1
    I like very much your plug-in, polymorphic approach, as well as the switch to build as a reusable library or stand-alone, test executable. Very well thought =) –  Oct 10 '12 at 16:12
5

There are many algorithm to count the set bits; but i think the best one is the faster one! You can see the detailed on this page:

Bit Twiddling Hacks

I suggest this one:

Counting bits set in 14, 24, or 32-bit words using 64-bit instructions

unsigned int v; // count the number of bits set in v
unsigned int c; // c accumulates the total bits set in v

// option 1, for at most 14-bit values in v:
c = (v * 0x200040008001ULL & 0x111111111111111ULL) % 0xf;

// option 2, for at most 24-bit values in v:
c =  ((v & 0xfff) * 0x1001001001001ULL & 0x84210842108421ULL) % 0x1f;
c += (((v & 0xfff000) >> 12) * 0x1001001001001ULL & 0x84210842108421ULL) 
     % 0x1f;

// option 3, for at most 32-bit values in v:
c =  ((v & 0xfff) * 0x1001001001001ULL & 0x84210842108421ULL) % 0x1f;
c += (((v & 0xfff000) >> 12) * 0x1001001001001ULL & 0x84210842108421ULL) % 
     0x1f;
c += ((v >> 24) * 0x1001001001001ULL & 0x84210842108421ULL) % 0x1f;

This method requires a 64-bit CPU with fast modulus division to be efficient. The first option takes only 3 operations; the second option takes 10; and the third option takes 15.

Mostafa
  • 670
  • 1
  • 10
  • 20
4

Naive Solution

Time Complexity is O(no. of bits in n)

int countSet(unsigned int n)
{
    int res=0;
    while(n!=0){
      res += (n&1);
      n >>= 1;      // logical right shift, like C unsigned or Java >>>
    }
   return res;
}

Brian Kerningam's algorithm

Time Complexity is O(no of set bits in n)

int countSet(unsigned int n)
{
  int res=0;
  while(n != 0)
  {
    n = (n & (n-1));
    res++;
  }
  return res;
} 

Lookup table method for 32-bit number- In this method we break the 32-bit number into chunks of four, 8-bit numbers

Time Complexity is O(1)

static unsigned char table[256]; /* the table size is 256,
                        the number of values i&0xFF (8 bits) can have */

void initialize() //holds the number of set bits from 0 to 255
{
  table[0]=0;
  for(unsigned int i=1;i<256;i++)
     table[i]=(i&1)+table[i>>1];
}

int countSet(unsigned int n)
{
  // 0xff is hexadecimal representation of 8 set bits.
  int res=table[n & 0xff];
  n=n>>8;
  res=res+ table[n & 0xff];
  n=n>>8;
  res=res+ table[n & 0xff];
  n=n>>8;
  res=res+ table[n & 0xff];
  return res;
}
Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
4

32-bit or not ? I just came with this method in Java after reading "cracking the coding interview" 4th edition exercice 5.5 ( chap 5: Bit Manipulation). If the least significant bit is 1 increment count, then right-shift the integer.

public static int bitCount( int n){
    int count = 0;
    for (int i=n; i!=0; i = i >> 1){
        count += i & 1;
    }
    return count;
}

I think this one is more intuitive than the solutions with constant 0x33333333 no matter how fast they are. It depends on your definition of "best algorithm" .

sarnold
  • 102,305
  • 22
  • 181
  • 238
Raymond Chenon
  • 11,482
  • 15
  • 77
  • 110
3

Python solution:

def hammingWeight(n: int) -> int:
    sums = 0
    while (n!=0):
        sums+=1
        n = n &(n-1)

    return sums

In the binary representation, the least significant 1-bit in n always corresponds to a 0-bit in n - 1. Therefore, anding the two numbers n and n - 1 always flips the least significant 1-bit in n to 0, and keeps all other bits the same.

Enter image description here

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
Jerry An
  • 1,077
  • 10
  • 18
2

Personally I use this :

  public static int myBitCount(long L){
      int count = 0;
      while (L != 0) {
         count++;
         L ^= L & -L; 
      }
      return count;
  }
SteveR
  • 2,496
  • 1
  • 24
  • 23
  • I like this method. It's not that hard to understand, there is only one like that really needs any comment. And when you understand that one line it is like a little gem that you have been given. (What was in my coffee this morning?) I have used this without the loop to check if only one single flag bit is set - i.e. process one way if only one flag bit is set, some other (more tedious) way if multiple flags are set. (edit: formatting) – David Jan 19 '16 at 22:47
2
int countBits(int x)
{
    int n = 0;
    if (x) do n++;
           while(x=x&(x-1));
    return n;
}   

Or also:

int countBits(int x) { return (x)? 1+countBits(x&(x-1)): 0; }

7 1/2 years after my original answer, @PeterMortensen questioned if this was even valid C syntax. I posted a link to an online compiler showing that it is in fact perfectly valid syntax (code below).

#include <stdio.h>
int countBits(int x)
{
    int n = 0;
    if (x) do n++;           /* Totally Normal Valid code. */
           while(x=x&(x-1)); /* Nothing to see here.       */
    return n;
}   
 
int main(void) {
    printf("%d\n", countBits(25));
    return 0;
}
 

Output:

3

If you want to re-write it for clarity, it would look like:

if (x)
{
    do
    {
        n++;
    } while(x=x&(x-1));
}

But that seems excessive to my eye.

However, I've also realized the function can be made shorter, but perhaps more cryptic, written as:

int countBits(int x)
{
    int n = 0;
    while (x) x=(n++,x&(x-1));
    return n;
}   
abelenky
  • 63,815
  • 23
  • 109
  • 159
1

Here is a solution that has not been mentioned so far, using bitfields. The following program counts the set bits in an array of 100000000 16-bit integers using 4 different methods. Timing results are given in parentheses (on MacOSX, with gcc -O3):

#include <stdio.h>
#include <stdlib.h>

#define LENGTH 100000000

typedef struct {
    unsigned char bit0 : 1;
    unsigned char bit1 : 1;
    unsigned char bit2 : 1;
    unsigned char bit3 : 1;
    unsigned char bit4 : 1;
    unsigned char bit5 : 1;
    unsigned char bit6 : 1;
    unsigned char bit7 : 1;
} bits;

unsigned char sum_bits(const unsigned char x) {
    const bits *b = (const bits*) &x;
    return b->bit0 + b->bit1 + b->bit2 + b->bit3 \
         + b->bit4 + b->bit5 + b->bit6 + b->bit7;
}

int NumberOfSetBits(int i) {
    i = i - ((i >> 1) & 0x55555555);
    i = (i & 0x33333333) + ((i >> 2) & 0x33333333);
    return (((i + (i >> 4)) & 0x0F0F0F0F) * 0x01010101) >> 24;
}

#define out(s) \
    printf("bits set: %lu\nbits counted: %lu\n", 8*LENGTH*sizeof(short)*3/4, s);

int main(int argc, char **argv) {
    unsigned long i, s;
    unsigned short *x = malloc(LENGTH*sizeof(short));
    unsigned char lut[65536], *p;
    unsigned short *ps;
    int *pi;

    /* set 3/4 of the bits */
    for (i=0; i<LENGTH; ++i)
        x[i] = 0xFFF0;

    /* sum_bits (1.772s) */
    for (i=LENGTH*sizeof(short), p=(unsigned char*) x, s=0; i--; s+=sum_bits(*p++));
    out(s);

    /* NumberOfSetBits (0.404s) */
    for (i=LENGTH*sizeof(short)/sizeof(int), pi=(int*)x, s=0; i--; s+=NumberOfSetBits(*pi++));
    out(s);

    /* populate lookup table */
    for (i=0, p=(unsigned char*) &i; i<sizeof(lut); ++i)
        lut[i] = sum_bits(p[0]) + sum_bits(p[1]);

    /* 256-bytes lookup table (0.317s) */
    for (i=LENGTH*sizeof(short), p=(unsigned char*) x, s=0; i--; s+=lut[*p++]);
    out(s);

    /* 65536-bytes lookup table (0.250s) */
    for (i=LENGTH, ps=x, s=0; i--; s+=lut[*ps++]);
    out(s);

    free(x);
    return 0;
}

While the bitfield version is very readable, the timing results show that it is over 4x slower than NumberOfSetBits(). The lookup-table based implementations are still quite a bit faster, in particular with a 65 kB table.

Stefan
  • 4,380
  • 2
  • 30
  • 33
  • 1
    Note that this is a microbenchmark and should be taken with a grain of salt. For example, setting up the lookup table just before using it is priming the cache in a way that may be rare in real code. – Kevin Cox Feb 10 '14 at 02:21
  • Re "4x slower ... quite a bit faster": That needs to be qualified. For example, is it true for [an AVR microcontroller](https://www.microchip.com/wwwproducts/en/ATmega32U4)? What kind of system of presumed? An [AMD Ryzen 3950X](https://en.wikipedia.org/wiki/Table_of_AMD_processors)? – Peter Mortensen Aug 26 '22 at 20:31
1
int bitcount(unsigned int n)
{ 
      int count=0;
      while(n)
      {
           count += n & 0x1u;
           n >>= 1;
      }
      return  count;
 }

Iterated 'count' runs in time proportional to the total number of bits. It simply loops through all the bits, terminating slightly earlier because of the while condition. Useful, if 1'S or the set bits are sparse and among the least significant bits.

Mufaddal Kagda
  • 160
  • 1
  • 10
1

Another Hamming weight algorithm if you're on a BMI2 capable CPU:

the_weight = __tzcnt_u64(~_pext_u64(data[i], data[i]));
Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
Anders Cedronius
  • 2,036
  • 1
  • 23
  • 29
  • (cram set bits to the low end, invert, count unset bits from the low end) – greybeard Nov 18 '15 at 23:07
  • Fun, but of no practical value. All BMI2 CPUs have `popcnt`. `pext same,same` to pack the bits could be an interesting building-block for something else, but `tzcnt` and `pext` both run on the same port as `popcnt` on Intel CPUs, and `pext` is very slow on AMD. (http://agner.org/optimize/). You can sort of emulate `pext x,x` with `(1ULL << popcnt(x)) - 1`, except for the x==0 case. x86 shifts can't shift out *all* the bits, because they mask the shift count, and you have to watch out for C undefined behaviour with out of range counts. – Peter Cordes Sep 22 '17 at 06:42
  • What compiler was it tried with? It seems to be compiler-specific (compiler extension __tzcnt_u64? Library function '__tzcnt_u64')? Can you add some context to the answer? – Peter Mortensen Aug 26 '22 at 21:05
  • Sure. for MacOS and Linux #include for Visual Studio use #include . Modern compilers should not need any specific linker flags. However let me know if you're stuck and I'll help you out. – Anders Cedronius Aug 30 '22 at 09:23
1

In Java 8 or 9 just invoke Integer.bitCount .

Jonatan Kaźmierczak
  • 1,821
  • 2
  • 15
  • 9
1

You can use built in function named __builtin_popcount(). There is no__builtin_popcount in C++ but it is a built in function of GCC compiler. This function return the number of set bit in an integer.

int __builtin_popcount (unsigned int x);

Reference : Bit Twiddling Hacks

rashedcs
  • 3,588
  • 2
  • 39
  • 40
1

From Python 3.10 onwards, you will be able to use the int.bit_count() function, but for the time being, you can define this function yourself.

def bit_count(integer):
    return bin(integer).count("1")
Boštjan Mejak
  • 827
  • 9
  • 23
1

I am providing one more unmentioned algorithm, called Parallel, taken from here. The nice point about it that it is generic, meaning that the code is the same for bit sizes 8, 16, 32, 64, and 128.

I checked the correctness of its values and timings on an amount of 2^26 numbers for bits sizes 8, 16, 32, and 64. See the timings below.

This algorithm is a first code snippet. The other two are mentioned here just for reference, because I tested and compared to them.

Algorithms are coded in C++, to be generic, but it can be easily adopted to old C.

#include <type_traits>
#include <cstdint>

template <typename IntT>
inline size_t PopCntParallel(IntT n) {
    // https://graphics.stanford.edu/~seander/bithacks.html#CountBitsSetParallel
    using T = std::make_unsigned_t<IntT>;

    T v = T(n);
    v = v - ((v >> 1) & (T)~(T)0/3);                           // temp
    v = (v & (T)~(T)0/15*3) + ((v >> 2) & (T)~(T)0/15*3);      // temp
    v = (v + (v >> 4)) & (T)~(T)0/255*15;                      // temp
    return size_t((T)(v * ((T)~(T)0/255)) >> (sizeof(T) - 1) * 8); // count
}

Below are two algorithms that I compared with. One is the Kernighan simple method with a loop, taken from here.

template <typename IntT>
inline size_t PopCntKernighan(IntT n) {
    // http://graphics.stanford.edu/~seander/bithacks.html#CountBitsSetKernighan
    using T = std::make_unsigned_t<IntT>;
    T v = T(n);
    size_t c;
    for (c = 0; v; ++c)
        v &= v - 1; // Clear the least significant bit set
    return c;
}

Another one is using built-in __popcnt16()/__popcnt()/__popcnt64() MSVC's intrinsic (doc here). Or __builtin_popcount of CLang/GCC (doc here). This intrinsic should provide a very optimized version, possibly hardware:

#ifdef _MSC_VER
    // https://learn.microsoft.com/en-us/cpp/intrinsics/popcnt16-popcnt-popcnt64?view=msvc-160
    #include <intrin.h>
    #define popcnt16 __popcnt16
    #define popcnt32 __popcnt
    #define popcnt64 __popcnt64
#else
    // https://gcc.gnu.org/onlinedocs/gcc/Other-Builtins.html
    #define popcnt16 __builtin_popcount
    #define popcnt32 __builtin_popcount
    #define popcnt64 __builtin_popcountll
#endif

template <typename IntT>
inline size_t PopCntBuiltin(IntT n) {
    using T = std::make_unsigned_t<IntT>;
    T v = T(n);
    if constexpr(sizeof(IntT) <= 2)
        return popcnt16(uint16_t(v));
    else if constexpr(sizeof(IntT) <= 4)
        return popcnt32(uint32_t(v));
    else if constexpr(sizeof(IntT) <= 8)
        return popcnt64(uint64_t(v));
    else
        static_assert([]{ return false; }());
}

Below are the timings, in nanoseconds per one number. All timings are done for 2^26 random numbers. Timings are compared for all three algorithms and all bit sizes among 8, 16, 32, and 64. In sum, all tests took 16 seconds on my machine. The high-resolution clock was used.

08 bit Builtin 8.2 ns
08 bit Parallel 8.2 ns
08 bit Kernighan 26.7 ns

16 bit Builtin 7.7 ns
16 bit Parallel 7.7 ns
16 bit Kernighan 39.7 ns

32 bit Builtin 7.0 ns
32 bit Parallel 7.0 ns
32 bit Kernighan 47.9 ns

64 bit Builtin 7.5 ns
64 bit Parallel 7.5 ns
64 bit Kernighan 59.4 ns

128 bit Builtin 7.8 ns
128 bit Parallel 13.8 ns
128 bit Kernighan 127.6 ns

As one can see, the provided Parallel algorithm (first among three) is as good as MSVC's/CLang's intrinsic.


For reference, below is full code that I used to test speed/time/correctness of all functions.

As a bonus this code (unlike short code snippets above) also tests 128 bit size, but only under CLang/GCC (not MSVC), as they have unsigned __int128.

Try it online!

#include <type_traits>
#include <cstdint>

using std::size_t;

#if defined(_MSC_VER) && !defined(__clang__)
    #define IS_MSVC 1
#else
    #define IS_MSVC 0
#endif

#if IS_MSVC
    #define HAS128 false
#else
    using int128_t = __int128;
    using uint128_t = unsigned __int128;
    #define HAS128 true
#endif

template <typename T> struct UnSignedT { using type = std::make_unsigned_t<T>; };
#if HAS128
    template <> struct UnSignedT<int128_t> { using type = uint128_t; };
    template <> struct UnSignedT<uint128_t> { using type = uint128_t; };
#endif
template <typename T> using UnSigned = typename UnSignedT<T>::type;

template <typename IntT>
inline size_t PopCntParallel(IntT n) {
    // https://graphics.stanford.edu/~seander/bithacks.html#CountBitsSetParallel
    using T = UnSigned<IntT>;

    T v = T(n);
    v = v - ((v >> 1) & (T)~(T)0/3);                           // temp
    v = (v & (T)~(T)0/15*3) + ((v >> 2) & (T)~(T)0/15*3);      // temp
    v = (v + (v >> 4)) & (T)~(T)0/255*15;                      // temp
    return size_t((T)(v * ((T)~(T)0/255)) >> (sizeof(T) - 1) * 8); // count
}

template <typename IntT>
inline size_t PopCntKernighan(IntT n) {
    // http://graphics.stanford.edu/~seander/bithacks.html#CountBitsSetKernighan
    using T = UnSigned<IntT>;
    T v = T(n);
    size_t c;
    for (c = 0; v; ++c)
        v &= v - 1; // Clear the least significant bit set
    return c;
}

#if IS_MSVC
    // https://learn.microsoft.com/en-us/cpp/intrinsics/popcnt16-popcnt-popcnt64?view=msvc-160
    #include <intrin.h>
    #define popcnt16 __popcnt16
    #define popcnt32 __popcnt
    #define popcnt64 __popcnt64
#else
    // https://gcc.gnu.org/onlinedocs/gcc/Other-Builtins.html
    #define popcnt16 __builtin_popcount
    #define popcnt32 __builtin_popcount
    #define popcnt64 __builtin_popcountll
#endif

#define popcnt128(x) (popcnt64(uint64_t(x)) + popcnt64(uint64_t(x >> 64)))

template <typename IntT>
inline size_t PopCntBuiltin(IntT n) {
    using T = UnSigned<IntT>;
    T v = T(n);
    if constexpr(sizeof(IntT) <= 2)
        return popcnt16(uint16_t(v));
    else if constexpr(sizeof(IntT) <= 4)
        return popcnt32(uint32_t(v));
    else if constexpr(sizeof(IntT) <= 8)
        return popcnt64(uint64_t(v));
    else if constexpr(sizeof(IntT) <= 16)
        return popcnt128(uint128_t(v));
    else
        static_assert([]{ return false; }());
}

#include <random>
#include <vector>
#include <chrono>
#include <string>
#include <iostream>
#include <iomanip>
#include <map>

inline double Time() {
    static auto const gtb = std::chrono::high_resolution_clock::now();
    return std::chrono::duration_cast<std::chrono::duration<double>>(
        std::chrono::high_resolution_clock::now() - gtb).count();
}

template <typename T, typename F>
void Test(std::string const & name, F f) {
    std::mt19937_64 rng{123};
    size_t constexpr bit_size = sizeof(T) * 8, ntests = 1 << 6, nnums = 1 << 14;
    std::vector<T> nums(nnums);
    for (size_t i = 0; i < nnums; ++i)
        nums[i] = T(rng() % ~T(0));
    static std::map<size_t, size_t> times;
    double min_time = 1000;
    for (size_t i = 0; i < ntests; ++i) {
        double timer = Time();
        size_t sum = 0;
        for (size_t j = 0; j < nnums; j += 4)
            sum += f(nums[j + 0]) + f(nums[j + 1]) + f(nums[j + 2]) + f(nums[j + 3]);
        auto volatile vsum = sum;
        min_time = std::min(min_time, (Time() - timer) / nnums);
        if (times.count(bit_size) && times.at(bit_size) != sum)
            std::cout << "Wrong bit cnt checksum!" << std::endl;
        times[bit_size] = sum;
    }
    std::cout << std::setw(2) << std::setfill('0') << bit_size
        << " bit " << name << " " << std::fixed << std::setprecision(1)
        << min_time * 1000000000 << " ns" << std::endl;
}

int main() {
    #define TEST(T) \
        Test<T>("Builtin", PopCntBuiltin<T>); \
        Test<T>("Parallel", PopCntParallel<T>); \
        Test<T>("Kernighan", PopCntKernighan<T>); \
        std::cout << std::endl;
    
    TEST(uint8_t); TEST(uint16_t); TEST(uint32_t); TEST(uint64_t);
    #if HAS128
        TEST(uint128_t);
    #endif
    
    #undef TEST
}
Arty
  • 14,883
  • 6
  • 36
  • 69
0
#!/user/local/bin/perl


    $c=0x11BBBBAB;
     $count=0;
     $m=0x00000001;
    for($i=0;$i<32;$i++)
    {
        $f=$c & $m;
        if($f == 1)
        {
            $count++;
        }
        $c=$c >> 1;
    }
    printf("%d",$count);

ive done it through a perl script. the number taken is $c=0x11BBBBAB   
B=3 1s   
A=2 1s   
so in total  
1+1+3+3+3+2+3+3=19
  • 4
    Is there something special about this implementation? The accepted answer is obviously much more efficient than your answer, so how is this a "best" solution (as requested in the question)? – Simon MᶜKenzie Jun 07 '12 at 06:50
0

A simple way which should work nicely for a small amount of bits it something like this (For 4 bits in this example):

(i & 1) + (i & 2)/2 + (i & 4)/4 + (i & 8)/8

Would others recommend this for a small number of bits as a simple solution?

Matthew Mitchell
  • 5,293
  • 14
  • 70
  • 122
0

Convert the integer to a binary string and count the ones.

PHP solution:

substr_count(decbin($integer), '1');
Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
0

I have not seen this approach anywhere:

int nbits(unsigned char v) {
    return ((((v - ((v >> 1) & 0x55)) * 0x1010101) & 0x30c00c03) * 0x10040041) >> 0x1c;
}

It works per byte, so it would have to be called four times for a 32-bit integer. It is derived from the sideways addition, but it uses two 32-bit multiplications to reduce the number of instructions to only seven.

Most current C compilers will optimize this function using SIMD (SSE2) instructions when it is clear that the number of requests is a multiple of 4, and it becomes quite competitive. It is portable, can be defined as a macro or inline function and does not need data tables.

This approach can be extended to work on 16 bits at a time, using 64-bit multiplications. However, it fails when all 16 bits are set, returning zero, so it can be used only when the 0xFFFF input value is not present. It is also slower due to the 64-bit operations and does not optimize well.

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
cipilo
  • 81
  • 3
0

A simple algorithm to count the number of set bits:

int countbits(n) {
    int count = 0;
    while(n != 0) {
        n = n & (n-1);
        count++;
    }
    return count;
}

Take the example of 11 (1011) and try manually running through the algorithm. It should help you a lot!

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
Arjun Singh
  • 7
  • 1
  • 6
0
def hammingWeight(n):
    count = 0
    while n:
        if n&1:
            count += 1
        n >>= 1
    return count
Shyambeer Singh
  • 318
  • 2
  • 9
0

Here is the functional master race recursive solution, and it is by far the purest one (and can be used with any bit length!):

template<typename T>
int popcnt(T n)
{
  if (n>0)
    return n&1 + popcnt(n>>1);
  return 0; 
}
Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
monoceres
  • 4,722
  • 4
  • 38
  • 63
0

For Java, there is a java.util.BitSet. https://docs.oracle.com/javase/8/docs/api/java/util/BitSet.html

cardinality(): Returns the number of bits set to true in this BitSet.

The BitSet is memory efficient since it's stored as a Long.

Steven Chou
  • 1,504
  • 2
  • 21
  • 43
0

Kotlin pre 1.4

        fun NumberOfSetBits(i: Int): Int {
            var i = i
            i -= (i ushr 1 and 0x55555555)
            i = (i and 0x33333333) + (i ushr 2 and 0x33333333)
            return (i + (i ushr 4) and 0x0F0F0F0F) * 0x01010101 ushr 24
        }

This is more or less a copy of the answer seen in the top answer.

It is with the Java fixes and is then converted using the converter in the IntelliJ IDEA Community Edition

1.4 and beyond (as of 2021-05-05 - it could change in the future).

        fun NumberOfSetBits(i: Int): Int {
            return i.countOneBits()
        }

Under the hood it uses Integer.bitCount as seen here:

@SinceKotlin("1.4")
@WasExperimental(ExperimentalStdlibApi::class)
@kotlin.internal.InlineOnly
public actual inline fun Int.countOneBits(): Int = Integer.bitCount(this)

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
Jfm Meyers
  • 123
  • 1
  • 12
0

For those who want it in C++11 for any unsigned integer type as a consexpr function (tacklelib/include/tacklelib/utility/math.hpp):

#include <stdint.h>
#include <limits>
#include <type_traits>

const constexpr uint32_t uint32_max = (std::numeric_limits<uint32_t>::max)();

namespace detail
{
    template <typename T>
    inline constexpr T _count_bits_0(const T & v)
    {
        return v - ((v >> 1) & 0x55555555);
    }

    template <typename T>
    inline constexpr T _count_bits_1(const T & v)
    {
        return (v & 0x33333333) + ((v >> 2) & 0x33333333);
    }

    template <typename T>
    inline constexpr T _count_bits_2(const T & v)
    {
        return (v + (v >> 4)) & 0x0F0F0F0F;
    }

    template <typename T>
    inline constexpr T _count_bits_3(const T & v)
    {
        return v + (v >> 8);
    }

    template <typename T>
    inline constexpr T _count_bits_4(const T & v)
    {
        return v + (v >> 16);
    }

    template <typename T>
    inline constexpr T _count_bits_5(const T & v)
    {
        return v & 0x0000003F;
    }

    template <typename T, bool greater_than_uint32>
    struct _impl
    {
        static inline constexpr T _count_bits_with_shift(const T & v)
        {
            return
                detail::_count_bits_5(
                    detail::_count_bits_4(
                        detail::_count_bits_3(
                            detail::_count_bits_2(
                                detail::_count_bits_1(
                                    detail::_count_bits_0(v)))))) + count_bits(v >> 32);
        }
    };

    template <typename T>
    struct _impl<T, false>
    {
        static inline constexpr T _count_bits_with_shift(const T & v)
        {
            return 0;
        }
    };
}

template <typename T>
inline constexpr T count_bits(const T & v)
{
    static_assert(std::is_integral<T>::value, "type T must be an integer");
    static_assert(!std::is_signed<T>::value, "type T must be not signed");

    return uint32_max >= v ?
        detail::_count_bits_5(
            detail::_count_bits_4(
                detail::_count_bits_3(
                    detail::_count_bits_2(
                        detail::_count_bits_1(
                            detail::_count_bits_0(v)))))) :
        detail::_impl<T, sizeof(uint32_t) < sizeof(v)>::_count_bits_with_shift(v);
}

Plus tests in google test library:

#include <stdlib.h>
#include <time.h>

namespace {
    template <typename T>
    inline uint32_t _test_count_bits(const T & v)
    {
        uint32_t count = 0;
        T n = v;
        while (n > 0) {
            if (n % 2) {
                count += 1;
            }
            n /= 2;
        }
        return count;
    }
}

TEST(FunctionsTest, random_count_bits_uint32_100K)
{
    srand(uint_t(time(NULL)));
    for (uint32_t i = 0; i < 100000; i++) {
        const uint32_t r = uint32_t(rand()) + (uint32_t(rand()) << 16);
        ASSERT_EQ(_test_count_bits(r), count_bits(r));
    }
}

TEST(FunctionsTest, random_count_bits_uint64_100K)
{
    srand(uint_t(time(NULL)));
    for (uint32_t i = 0; i < 100000; i++) {
        const uint64_t r = uint64_t(rand()) + (uint64_t(rand()) << 16) + (uint64_t(rand()) << 32) + (uint64_t(rand()) << 48);
        ASSERT_EQ(_test_count_bits(r), count_bits(r));
    }
}
Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
Andry
  • 2,273
  • 29
  • 28
0

Counting set bits in binary representation (N):

Pseudocode -

  1. set counter = 0.

  2. repeat counting till N is not zero.

  3. check last bit. if last bit = 1 , increment counter

  4. Discard last digit of N.

Now let's code this in C++

int countSetBits(unsigned int n){

int count = 0;

while(n!=0){

    count += n&1;

    n = n >>1;
}

  return count;

}

Let's use this function.

int main(){

 int x = 5;
 cout<<countSetBits(x);

 return 0;
}

Output: 2

Because 5 has 2 bits set in binary representation (101).

You can run the code here.

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
Mayukh Pankaj
  • 151
  • 2
  • 5
0

I'll contribute to @Arty's answer

__popcnt16()/__popcnt()/__popcnt64() MSVC's intrinsic (doc here)

popcnt instruction, as noted in "Remarks" section, is available as part of SSE4 instruction set and there is a relatively high chance of it not being available.

If you run code that uses these intrinsics on hardware that doesn't support the popcnt instruction, the results are unpredictable.

So, you need to implement a check as per "Remarks" section:

To determine hardware support for the popcnt instruction, call the __cpuid intrinsic with InfoType=0x00000001 and check bit 23 of CPUInfo[2] (ECX). This bit is 1 if the instruction is supported, and 0 otherwise.

Here's how you do it:

unsigned popcnt(const unsigned input)
{
    struct cpuinfo_t
    {
        union
        {
            int regs[4];

            struct
            {
                long eax, ebx, ecx, edx;
            };
        };

        cpuinfo_t() noexcept : regs() {}
    }
    cpuinfo;

    // EAX=1: Processor Info and Feature Bits
    __cpuid(cpuinfo.regs, 1);

    // ECX bit 23: popcnt
    if (_bittest(&cpuinfo.ecx, 23))
    {
        return __popcnt(input);
    }

    // Choose any fallback implementation you like, there's already a ton of them
    unsigned num = input;
    num = (num & 0x55555555) + (num >> 1 & 0x55555555);
    num = (num & 0x33333333) + (num >> 2 & 0x33333333);
    num = (num & 0x0F0F0F0F) + (num >> 4 & 0x0F0F0F0F);
    num = (num & 0x00FF00FF) + (num >> 8 & 0x00FF00FF);
    num = (num & 0x0000FFFF) + (num >> 16 & 0x0000FFFF);
    return num;
}
limitedeternity
  • 143
  • 1
  • 6
0

Here is the sample code, which might be useful.

private static final int[] bitCountArr = new int[]{0, 1, 1, 2, 1, 2, 2, 3, 1, 2, 2, 3, 2, 3, 3, 4, 1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5, 1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5, 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6, 1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5, 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6, 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6, 3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7, 1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5, 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6, 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6, 3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7, 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6, 3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7, 3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7, 4, 5, 5, 6, 5, 6, 6, 7, 5, 6, 6, 7, 6, 7, 7, 8};
private static final int firstByteFF = 255;
public static final int getCountOfSetBits(int value){
    int count = 0;
    for(int i=0;i<4;i++){
        if(value == 0) break;
        count += bitCountArr[value & firstByteFF];
        value >>>= 8;
    }
    return count;
}
phant0m
  • 16,595
  • 5
  • 50
  • 82
0

Here's something that works in PHP (all PHP intergers are 32 bit signed, thus 31 bit):

function bits_population($nInteger)
{

    $nPop=0;
    while($nInteger)
    {
        $nInteger^=(1<<(floor(1+log($nInteger)/log(2))-1));
        $nPop++;
    }
    return $nPop;
}
oxygen
  • 5,891
  • 6
  • 37
  • 69
-1
// How about the following:
public int CountBits(int value)
{
    int count = 0;
    while (value > 0)
    {
        if (value & 1)
            count++;
        value <<= 1;
    }
    return count;
}
Jim McCurdy
  • 2,052
  • 14
  • 14
-1

You can do something like:

int countSetBits(int n)
{
    n=((n&0xAAAAAAAA)>>1) + (n&0x55555555);
    n=((n&0xCCCCCCCC)>>2) + (n&0x33333333);
    n=((n&0xF0F0F0F0)>>4) + (n&0x0F0F0F0F);
    n=((n&0xFF00FF00)>>8) + (n&0x00FF00FF);
    return n;
}

int main()
{
    int n=10;
    printf("Number of set bits: %d",countSetBits(n));
     return 0;
}

See heer: http://ideone.com/JhwcX

The working can be explained as follows:

First, all the even bits are shifted towards right & added with the odd bits to count the number of bits in group of two. Then we work in group of two, then four & so on..

Green goblin
  • 9,898
  • 13
  • 71
  • 100
  • Buggy: https://godbolt.org/g/YuN5cA returns 131083 for `n=1234667`. You need to combine the two 8-bit chunks that the last step leaves, and clear the high bits. Also, it's apparently possible to be more efficient than this SWAR 1-bit, 2-bit, 4-bit sequence. I haven't grokked the magic of the top answer's bit-twiddling hack, though. https://stackoverflow.com/questions/109023/how-to-count-the-number-of-set-bits-in-a-32-bit-integer/109025#109025 – Peter Cordes Sep 22 '17 at 06:50
-1

I am giving two algorithms to answer the question,

package countSetBitsInAnInteger;

import java.util.Scanner;

public class UsingLoop {

    public static void main(String[] args) {
        Scanner in = new Scanner(System.in);
        try {
            System.out.println("Enter a integer number to check for set bits in it");
            int n = in.nextInt();
            System.out.println("Using while loop, we get the number of set bits as: " + usingLoop(n));
            System.out.println("Using Brain Kernighan's Algorithm, we get the number of set bits as: " + usingBrainKernighan(n));
            System.out.println("Using ");
        }
        finally {
            in.close();
        }
    }

    private static int usingBrainKernighan(int n) {
        int count = 0;
        while(n > 0) {
            n& = (n-1);
            count++;
        }
        return count;
    }

    /*
        Analysis:
            Time complexity = O(lgn)
            Space complexity = O(1)
    */

    private static int usingLoop(int n) {
        int count = 0;
        for(int i=0; i<32; i++) {
            if((n&(1 << i)) != 0)
                count++;
        }
        return count;
    }

    /*
        Analysis:
            Time Complexity = O(32) // Maybe the complexity is O(lgn)
            Space Complexity = O(1)
    */
}
Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
Nikhil Katre
  • 2,114
  • 23
  • 22
-1

For JavaScript, you can count the number of set bits on a 32-bit value using a lookup table (and this code can be easily translated to C). In addition, added 8-bit and 16-bit versions for completeness for people who find this through web search.

const COUNT_BITS_TABLE = makeLookupTable()

function makeLookupTable() {
  const table = new Uint8Array(256)
  for (let i = 0; i < 256; i++) {
    table[i] = (i & 1) + table[(i / 2) | 0];
  }
  return table
}

function countOneBits32(n) {
  return COUNT_BITS_TABLE[n & 0xff] +
    COUNT_BITS_TABLE[(n >> 8) & 0xff] +
    COUNT_BITS_TABLE[(n >> 16) & 0xff] +
    COUNT_BITS_TABLE[(n >> 24) & 0xff];
}

function countOneBits16(n) {
  return COUNT_BITS_TABLE[n & 0xff] +
    COUNT_BITS_TABLE[(n >> 8) & 0xff]
}

function countOneBits8(n) {
  return COUNT_BITS_TABLE[n & 0xff]
}

console.log('countOneBits32', countOneBits32(0b10101010000000001010101000000000))
console.log('countOneBits32', countOneBits32(0b10101011110000001010101000000000))
console.log('countOneBits16', countOneBits16(0b1010101000000000))
console.log('countOneBits8', countOneBits8(0b10000010))
Lance
  • 75,200
  • 93
  • 289
  • 503
-2

I use the following function. I haven't checked benchmarks, but it works.

int msb(int num)
{
    int m = 0;
    for (int i = 16; i > 0; i = i>>1)
    {
        // debug(i, num, m);
        if(num>>i)
        {
            m += i;
            num>>=i;
        }
    }
    return m;
}
Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
prongs
  • 9,422
  • 21
  • 67
  • 105
-2

This is the implementation in golang

func CountBitSet(n int) int {


    count := 0
    for n > 0 {
      count += n & 1
      n >>= 1

    }
    return count
}
Bamidele Alegbe
  • 524
  • 4
  • 7
-3
public class BinaryCounter {

private int N;

public BinaryCounter(int N) {
    this.N = N;
}

public static void main(String[] args) {

    BinaryCounter counter=new BinaryCounter(7);     
    System.out.println("Number of ones is "+ counter.count());

}

public int count(){
    if(N<=0) return 0;
    int counter=0;
    int K = 0;
    do{
        K = biggestPowerOfTwoSmallerThan(N);
        N = N-K;
        counter++;
    }while (N != 0);
    return counter;

}

private int biggestPowerOfTwoSmallerThan(int N) {
    if(N==1) return 1;
    for(int i=0;i<N;i++){
        if(Math.pow(2, i) > N){
            int power = i-1;
            return (int) Math.pow(2, power);
        }
    }
    return 0;
}
}
John Smith
  • 7,243
  • 6
  • 49
  • 61
Burhan ARAS
  • 2,517
  • 25
  • 19
-3

This will also work fine:

int ans = 0;
while(num) {
  ans += (num & 1);
  num = num >> 1;
}    
return ans;
Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
decrypto
  • 1
  • 1
  • 1
    How is this different from [this existing answer](https://stackoverflow.com/a/24099386/1048572)? – Bergi Feb 11 '18 at 13:17
  • 2
    Both without a mention of [S.E. Anderson's Bit Twiddling Hacks](http://graphics.stanford.edu/~seander/bithacks.html#CountBitsSetKernighan) or B. Kernighan. Sigh. – greybeard Feb 11 '18 at 19:10
  • 1
    This only counts the number of initial set bits. – S.S. Anne Sep 19 '19 at 17:31