Is there a way to have a fair implementation of Brian Kernighan's algorithm for counting ones in Python?

Question

I was surprised by the big difference in performance of an implementation of Brian Kernighan's algorithm for counting ones in python when compared to a built in function for counting ones in a string.

Converting to a String an then counting ones seemed to me like a bad idea.

Now what seems like a bad idea is looping and not using built-in functions when looking for performance.

import random

x = random.randint(0,1<<1000000)

def count_ones(number):
    c = 0
    while(number !=0 ):
        number = number&(number-1)
        c = c + 1
    return c


%timeit bin(x).count("1")
%timeit count_ones(x)


5.09 ms ± 20 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
25 s ± 544 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Counting ones where? in an arbitrary number's binary representation? — DeepSpace, Sep 23 '18 at 13:24
Algorithm timings in Python can sometimes be counter-intuitive, since the built-in functions do their stuff at C speed. See https://stackoverflow.com/questions/9829578/fast-way-of-counting-non-zero-bits-in-positive-integer I have comprehensive speed tests of bit counting [here](https://gist.github.com/PM2Ring/1addf93cc3976b699d31ff1bad45e4ed) — PM 2Ring, Sep 23 '18 at 13:28
I might be tempted to do this along the lines of: `sum(x & (1 << n) > 0 for n in range(x.bit_length()))`.... not sure how performant that is... — Jon Clements, Sep 23 '18 at 13:29
In CPython builtin functions are in machine code, while wunning Python requires running whole interpreter, so it is very often faster to use builtins rather than implementing custom algorithms. Even if they would be faster if compiled to machine code. They would probably be faster when run on PyPy. — zch, Sep 23 '18 at 13:30
If you have Python 3.6+, `def bincount_f(n): return f'{n:b}'.count("1")` is slightly faster than `bincount_fmt`, but still slower than `bin(n).count("1")`. — PM 2Ring, Sep 23 '18 at 13:33
Or maybe `sum(1 for n in range(x.bit_length()) if x & (1 << n))` is slightly more sensible... — Jon Clements, Sep 23 '18 at 13:35
But that's not all. Your algorithm has actually quadratic time with respect to size of representation of the number, so it would still not be efficient for bigints. — zch, Sep 23 '18 at 13:35
@JonClements Your 1st suggestion is positively glacial. :) About 20 times slower than `bin(n).count("1")` on 32 bit ints. The new one is better, but still around 10 times slower. — PM 2Ring, Sep 23 '18 at 13:40
@JonClements %timeit sum(x & (1 << n) > 0 for n in range(x.bit_length())) 14.7 s ± 283 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) %timeit sum(1 for n in range(x.bit_length()) if x & (1 << n)) 14.4 s ± 197 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) — Stan, Sep 23 '18 at 13:42
@PM2Ring that's actually faster than I thought...:) One can always use the builtin `bin` code as a base and adapt it to return an integer of 1's rather than a string I guess though :) — Jon Clements, Sep 23 '18 at 13:43
@Stan yeah... not great... almost twice as fast as the existing bid fiddling though :) — Jon Clements, Sep 23 '18 at 13:44
@JonClements Or just use `gmpy.popcount`, which is way faster. — PM 2Ring, Sep 23 '18 at 13:49
@zch I see what you mean by quadratic time, didn't pay attention to this fact. Bitwise operations in registers are fast but they won't longer be so for whatever internal representation python has for big numbers. — Stan, Sep 23 '18 at 15:22
@PM2Ring your solution f'{x:b}'.count("1") is slightly faster than bin(x).count("1") in my machine (5.1ms vs 5.88 ms) — Stan, Sep 23 '18 at 15:47
Interesting! I currently use Python 3.6.0 on a very old 32 bit machine with a 2GHz single-core CPU, running Linux. FWIW, when f-strings were first introduced they were slower than `format` / `.format`, but that got fixed partly due to the efforts of an SO Python regular. — PM 2Ring, Sep 23 '18 at 16:40

score 1 · Answer 1 · answered Sep 24 '18 at 03:59

Kernighan's algorithm works best on data which fits in the ALU, often 64-bits on modern hardware. For longer numbers, the algorithm becomes quadratic in the length of the number because each iteration does a computation over the entire length of the number. The arithmetic can be hand-optimised because it's clear that once the borrow stops propagating, nothing will change as a result of the bitwise and.

Even with that optimisation, we're still in Shlemiel the Painter territory; the computation is quadratic becsuse the scan effectively always starts at the same place on every iteration, scanning further and further each time.

Anyway, it would be very surprising to see even a sophisticated optimiser find that optimisation and Python's bignum implementation does not have a sophisticated optimiser. It does not, for example, fuse the decrement with the bitwise and.

Bitwise and on bignums could obviously be done in place, so it would be tempting to write:

number &= number - 1

in the hopes that an in-place operation would be performed. But it won't make any difference with CPython; CPython's bignums are immutable so in-place mutation is not possible.

In short, Python is going to create two new million-bit numbers for every iteration. It's not terribly surprising that it takes a while. Rather, it is surprising that it only takes 25 seconds.

Depending on version and platform, CPython bignum operations are performed in 15- or 30-bit units. (Computation on integers which fit into 30 bits is slightly ootimised. But 64 bits is well outside of that range, so don't expect limiting numbers to 64 bits to avoid bignum overhead.) Assuming your platform uses 30-bit units, running the algorithm for the expected half-million iterations means performing more than 66 billion single-word computations (a subtract and a bitwise on (1000000/30 = 33334)-word bignums), plus a million 130KB memory allocations. Doing that in 25 seconds is not shabby at all. It's a testimony to how fast modern CPUs are, and a warning about how easy it is to not notice that you're using a quadratic algorithm until the numbers get really large.

Is there a way to have a fair implementation of Brian Kernighan's algorithm for counting ones in Python?

1 Answers1