I was asked the above question in an interview and interviewer is very certain of the answer. But i am not sure. Can anyone help me here?
3 Answers
Sure. The obvious brute force method is just a big lookup table with one entry for every possible value of the input number. That's not very practical if the number is very big, but is still enough to prove it's possible.
Edit: the notion has been raised that this is complete nonsense, and the same could be said of essentially any algorithm.
To a limited degree, that's a fair statement -- but the limitations are so severe that for most algorithms it remains utterly meaningless.
My original point (at least as well as I remember it) was that population counting is about equivalent to many other operations like addition and subtraction that we normally assume are O(1).
At the hardware level, circuitry for a single-cycle POPCNT
instruction is probably easier than for a single-cycle ADD
instruction. Just for one example, for any practical size of data word, we can use table lookups on 4-bit chunks in parallel, then add the results from those pieces together. Even using fairly unlikely worst-case assumptions (e.g., separate storage for each of those tables) this would still be easy to implement in a modern CPU -- in fact, it's probably at least somewhat simpler than the single-cycle addition or subtraction mentioned above1.
This is a decided contrast to many other algorithms. For one obvious example, let's consider sorting. For even the most trivial sort most people can imagine -- 2 items, 8 bits apiece, we're already at a 64 kilobyte lookup table to get constant complexity. Long before we can do even a fairly trivial sort (e.g., 100 items) we need a lookup table that contains far more data items than there are atoms in the universe.
Looking at it from the opposite direction, it's certainly true that at some point, essentially nothing is O(1) any more. Let's consider the most trivial operations possible. For an N-bit CPU, bitwise OR
is normally implemented as a set of N OR
gates in parallel. Unlike addition, there's no interaction between one bit and another, so for any practical size of CPU, this easy to execute in a single instruction.
Nonetheless, if I specify a bit-wise OR
in which each operand is 100 petabits, there's nothing even approaching a practical way to do the job with constant complexity. Using the usual method of parallel OR
gates, we end up with (among other things) 300 petabits worth of input and output lines -- a number that completely dwarfs even the number of pins on the largest CPUs.
On reasonable hardware, doing a bitwise OR
on 100 petabit operands is going to take a while (not to mention quite a bit of hard drive space). If we increase that to 200 petabit operands, the time is likely to (about) double -- so from that viewpoint, it's an O(N) operation. Obviously enough, the same is going to be true with the other "trivial" operations like addition, subtraction, bit-wise AND
, bit-wise XOR
, and so on.
Nonetheless, unless you have very specific instructions to say you're going to be dealing with utterly immense operands, you're typically going to treat every one of these as a constant complexity operation. Looked at in these terms, a POPCNT instruction falls about halfway between bit-wise AND
/OR
/XOR
on one hand, and addition/subtraction on the other, in terms of the difficulty to execute in fixed time.
1. You might wonder how it could possibly be simpler than an add
when it actually includes an add
after doing some other operations. If so, kudos -- it's an excellent question.
The answer is that it's because it only needs a much smaller adder. For example, a 64-bit CPU needs one half-adder and 63 full-adders. In the simple implementation, you carry out the addition bit-wise -- i.e., you add bit-0 of one operand to bit-0 of the other. That generates an output bit, and a carry bit. That carry bit becomes an input to the addition for the next pair of bits. There are some tricks to parallelize that to some degree, but the nature of the beast (so to speak) is bit-serial.
With a POPCNT instruction, we have an addition after doing the individual table lookups, but our result is limited to the size of the input words. Given the same size of inputs (64 bits) our final result can't be any larger than 64. That means we only need a 6-bit adder instead of a 64-bit adder.
Since, as outlined above, addition is basically bit-serial, this means that the addition at the end of the POPCNT
instruction is fundamentally a lot faster than a normal add. To be specific, it's logarithmic on the operand size, whereas simple addition is roughly linear on the operand size.

- 476,176
- 80
- 629
- 1,111
-
this is pure BS^H^H nonsense. you could present this argument to any algorithmic question. – Karoly Horvath Sep 17 '13 at 11:20
-
@KarolyHorvath: See edited answer. I've gone into considerably more detail about why I think it's reasonable here, but not for many (most?) other algorithms. – Jerry Coffin Sep 17 '13 at 14:46
-
If popcount were as cheap to implement as `add`, I think we'd see more CPUs having single-cycle latency for it. But maybe it's a tradeoff between number latency vs. area (number of transistors), and not being nearly as common as `add`, most architects didn't spend as many? Per https://agner.org/optimize/ and https://uops.info/, Intel runs `popcnt` with 3 cycle latency, 1/clock throughput. (On the same execution unit that does bsf/bsr/tzcnt/lzcnt: [How is POPCNT implemented in hardware?](https://stackoverflow.com/q/28802692) quotes a patent on the compressor tree implementation technique) – Peter Cordes Dec 05 '22 at 14:53
-
AMD CPUs like K10 had 2 cycle latency, Bulldozer (aiming at higher clock speeds) had 4 cycle latency, and 2c throughput for 16/32-bit operand-size, 4c throughput for 64-bit operand-size. So not even fully pipelined. But with Zen1, it has the same performance as `add`: 1c latency, 0.25c throughput, so there's a single-cycle popcount unit on each integer execution port. (`lzcnt` performance changed to match, hinting that AMD runs them on the same execution units.) – Peter Cordes Dec 05 '22 at 14:58
-
*addition is basically bit-serial,* - I think that's where you're going wrong in estimating the difficulty of `add`. You're thinking of a simple ripple-carry adder, but single-cycle latency for 64-bit `add` generally requires [carry-lookahead](https://en.wikipedia.org/wiki/Carry-lookahead_adder) or [carry-select](https://en.wikipedia.org/wiki/Carry-select_adder) or other advanced techniques with something like O(n * sqrt(n)) gates, giving a gate-delay latency of O(sqrt(n)) – Peter Cordes Dec 05 '22 at 15:05
-
@PeterCordes: I think your fundamental thesis is wrong, and you've even hinted at the reason: although a few specific applications benefit from having a popcnt that's faster than a bit-by-bit count in software, in practice use of popcnt is so rare that nobody bothers to optimize it much. – Jerry Coffin Dec 05 '22 at 16:43
-
@JerryCoffin: Clearly from AMD Zen's example, we can see that it's possible in single-cycle latency at high clock speeds. And yeah, that fact that other designs don't do that might be a matter of die area tradeoff, or possibly even lack of design time/effort. As for being *simpler* than single-cycle `add`, I'm not fully convinced; extra gates to shortcut carry propagation aren't that expensive. popcount is still a tree of sums, like reducing the partial-products in a multiplier (except much narrower), so possibly something like a https://en.wikipedia.org/wiki/Dadda_multiplier – Peter Cordes Dec 05 '22 at 16:55
-
As for popcount not being important enough to optimize in hardware, there is an AVX512 VPOPCNT extension, with Intel and AMD giving it the same performance characteristics as their scalar versions: 3c latency for Intel, 1c for AMD. Intel running it only on one port, AMD on two. IDK if that really tells us anything, except that maybe Intel CPU architects maybe copy/pasted some logic blocks instead of re-examining how they do popcnt. – Peter Cordes Dec 05 '22 at 16:59
-
This piqued my interest enough that I hacked a popcount instruction into the [DarkRiscV](https://github.com/darklife/darkriscv), and synthesized it for a Xilinx FPGA. The opcode I chose used up the only missing item in a case statement in the decoder. As a result, although it adds a little chip area (distributed RAM to hold 16x3 bit ROMs) the maximum estimated speed actually went *up* a little (though you could also achieve that by adding a `default` to the case statement). The clock probably wasn't previously limited by addition, but adding popcnt didn't slow it down at all either. – Jerry Coffin Dec 06 '22 at 01:08
-
FPGA implementation could be affecting things a bit, and the design I picked was just something simple I could grab and hack on easily--in a different design that started out more optimized, results might be different. Nonetheless, I think it's a decent indication that there's no real reason to expect this implementation of popcnt to slow the clock or noticeably increase chip area. – Jerry Coffin Dec 06 '22 at 01:12
If the bit size is fixed (e.g. natural word size of a 32- or 64-bit machine), you can just iterate over the bits and count them directly in O(1) time (though there are certainly faster ways to do it). For arbitrary precision numbers (BigInt, etc.), the answer must be no.

- 28,429
- 12
- 61
- 81
Some processors can do it in one instruction, obviously for integers of limited size. Look up the POPCNT mnemonic for further details.
For integers of unlimited size obviously you need to read the whole input, so the lower bound is O(n).
The interviewer probably meant the bit counting trick (the first Google result follows): http://www.gamedev.net/topic/547102-bit-counting-trick---new-to-me/

- 43,216
- 11
- 77
- 90