Why are bitwise operators slower than multiplication/division/modulo?

Question

It's a well known fact that multiplication, integer division, and modulo by powers of two can be rewritten more efficiently as bitwise operations:

>>> x = randint(50000, 100000)
>>> x << 2 == x * 4
True
>>> x >> 2 == x // 4
True
>>> x & 3 == x % 4
True

In compiled languages such as C/C++ and Java, tests have shown that bitwise operations are generally faster than arithmetic operations. (See here and here). However, when I test these in Python, I am getting contrary results:

In [1]: from random import randint
   ...: nums = [randint(0, 1000000) for _ in range(100000)]

In [2]: %timeit [i * 8 for i in nums]
7.73 ms ± 397 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [3]: %timeit [i << 3 for i in nums]
8.22 ms ± 368 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [4]: %timeit [i // 8 for i in nums]
7.05 ms ± 393 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [5]: %timeit [i >> 3 for i in nums]
7.55 ms ± 367 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [6]: %timeit [i % 8 for i in nums]
5.96 ms ± 503 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [7]: %timeit [i & 7 for i in nums]
8.29 ms ± 816 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

As you can see, bitwise operations are slower than their arithmetic counterparts, especially for modulo. I repeated this test for another set of numbers, and got the same result. Is there a reason for this? These tests were in CPython 3.6.7, if that matters.

Well, for starters, Python `int` objects are nothing like a C `int`, they are *objects*. The underlying representation isn't even a C `int`, it is an array of digits, if you understand C, take a look for yourself: https://github.com/python/cpython/blob/master/Include/longintrepr.h This is because Python `int` objects are *arbitrarily sized*, not fixed size. — juanpa.arrivillaga, Jan 04 '19 at 22:45
divide and modulo by a power of 2 isn't a great test; any decent C compiler would compile that to a shift. (Or for signed `i`, shifts + stuff to get the rounding semantics right for possibly-negative `i`). I don't expect that CPython will, but if you want to compare it to C, use a runtime-variable divisor. (Otherwise `i / 12345` can still compile to a fast multiply / shift. [Why does GCC use multiplication by a strange number in implementing integer division?](https://stackoverflow.com/q/41183935). Modern x86 CPUs have very high performance multiply that's about as fast as 3 shifts.) — Peter Cordes, Jan 05 '19 at 00:24
@PeterCordes I am aware compilers make such optimizations, so I used the `dis` module to check that `i % ` is not doing a bit shift, and I far as I could tell, it wasn't. However, maybe the optimization was further down so the `dis` module could not reveal it... I'm not sure. — iz_, Jan 05 '19 at 03:30
I don't know much about Python internals, but such an optimization is only likely / really good at a point where the optimization can be done once for a constant, not every time through the loop. Checking `x & (x-1) == 0` to check for a power of 2, then using a bit-scan like POSIX `ffs()` or x86 `bsf` / `tzcnt` to get a shift count is probably not worth it unless the implementation expects powers of 2 to be very common. (And then only for divide if at all, not multiply, unless the input has multiple limbs). — Peter Cordes, Jan 05 '19 at 03:41

user2357112 · Accepted Answer · 2019-01-04T22:56:54.497

12

*, %, and / all have fast paths for single-"limb" integers. <<, >>, and & don't. They're going through the general-purpose arbitrary-precision code path.

edited Jan 04 '19 at 22:56

answered Jan 04 '19 at 22:49

user2357112

260,549
28
431
505

So this "fast path" you speak of is occurring in the interpretation of the code, correct? – iz_ Jan 04 '19 at 22:54
It's in the operator implementations. – user2357112 Jan 04 '19 at 22:57
Thanks, the source code to the implementations was helpful. One more thing, could you elaborate on what "single-limb" integers are? – iz_ Jan 04 '19 at 23:00
Limbs are digits, but bigger. It doesn't look like the CPython source uses the term; I got the term from reading about GMP. The CPython source just uses "digit". – user2357112 Jan 04 '19 at 23:01
So, to clarify, for extremely large numbers, will the bitwise operators be faster? – iz_ Jan 04 '19 at 23:04
4

@Tomothy32: a "limb" is a base-2^30 "digit", stored as a binary integer in a `uint32_t` or equivalent. CPython stores big integers in chunks of that size. Just like how you can add decimal numbers on paper one digit at a time, using base-2^30 digits allows CPython to get lots of work done with each primitive add operation the CPU does. (Using base 2^32 chunks would allow using add / add-with-carry in asm to take advantage of hardware support for big-integers on most architectures, but that's not what CPython does. 2^30 has some advantages, though, esp. for pure C without asm.) – Peter Cordes Jan 05 '19 at 00:14
@PeterCordes Thanks for the explanation, I couldn't find anything well-documented about limbs online! – iz_ Jan 05 '19 at 03:32

score 1 · Answer 2 · answered Dec 09 '19 at 07:12

1

I test large numbers, and the bitwise operators is faster.

python -m timeit '[i for i in range(10**64, 10**64+1000) if i & 0b10==0]'
1000 loops, best of 3: 238 usec per loop

python -m timeit '[i for i in range(10**64, 10**64+1000) if i % 2==0]'
1000 loops, best of 3: 303 usec per loop

answered Dec 09 '19 at 07:12

JackyChou

19
2

Those aren't equivalent: the first should be `i & 1`. – Davis Herring Dec 15 '22 at 05:18

Why are bitwise operators slower than multiplication/division/modulo?

2 Answers2

Linked

Related