Times-two faster than bit-shift, for Python 3.x integers?

Question

I was looking at the source of sorted_containers and was surprised to see this line:

self._load, self._twice, self._half = load, load * 2, load >> 1

Here load is an integer. Why use bit shift in one place, and multiplication in another? It seems reasonable that bit shifting may be faster than integral division by 2, but why not replace the multiplication by a shift as well? I benchmarked the the following cases:

(times, divide)
(shift, shift)
(times, shift)
(shift, divide)

and found that #3 is consistently faster than other alternatives:

# self._load, self._twice, self._half = load, load * 2, load >> 1

import random
import timeit
import pandas as pd

x = random.randint(10 ** 3, 10 ** 6)

def test_naive():
    a, b, c = x, 2 * x, x // 2

def test_shift():
    a, b, c = x, x << 1, x >> 1    

def test_mixed():
    a, b, c = x, x * 2, x >> 1    

def test_mixed_swapped():
    a, b, c = x, x << 1, x // 2

def observe(k):
    print(k)
    return {
        'naive': timeit.timeit(test_naive),
        'shift': timeit.timeit(test_shift),
        'mixed': timeit.timeit(test_mixed),
        'mixed_swapped': timeit.timeit(test_mixed_swapped),
    }

def get_observations():
    return pd.DataFrame([observe(k) for k in range(100)])

The question:

Is my test valid? If so, why is (multiply, shift) faster than (shift, shift)?

I run Python 3.5 on Ubuntu 14.04.

Edit

Above is the original statement of the question. Dan Getz provides an excellent explanation in his answer.

For the sake of completeness, here are sample illustrations for larger x when multiplication optimizations do not apply.

I would really like to see if there is any difference using little endian/big endian. Really cool question btw ! — LiGhTx117, May 05 '16 at 14:45
JBernardo, it got lost in my interactive session with the reloads. will make an edit. The results seem consistent. — hilberts_drinking_problem, May 05 '16 at 14:46
@LiGhTx117 I'd expect that to be unrelated to the operations, unless `x` is very large, because that's just a question of how it's stored in memory, right? — Dan Getz, May 05 '16 at 14:47
I'm curious, what about multiplying by 0.5 instead of dividing by 2? From previous experience with mips assembly programming, division normally results in a multiplication operation anyway. (That would explain the preference of bit shifting instead of division) — Sayse, May 05 '16 at 14:58
@Sayse that would convert it to floating point. Hopefully integer floor division would be quicker than a round trip through floating point. — Dan Getz, May 05 '16 at 15:05
@DanGetz - Yes, I agree, hence the curiosity (also that experience was a long time ago working with floating points) — Sayse, May 05 '16 at 15:06
Note that running this test on python2.x doesn't seem to point at one way being consistently faster than the other (at least not for me -- and I've only tested `naive` and `mixed`). This leads me to believe that it doesn't have to do with the operations being faster at the C level since python3's integers are arbitrary precision... — mgilson, May 05 '16 at 15:21
Integer multiplication takes 1 instruction time in most modern ALUs. It shouldn't be slower that shift. So the actual question here is why is the shift operator slower. — xvan, May 05 '16 at 15:22
And, for people smarter than me: [rshift](https://hg.python.org/cpython/file/tip/Objects/longobject.c#l4195) and [multiply](https://hg.python.org/cpython/file/tip/Objects/longobject.c#l3472). It looks like the operations are done using some algorithms from Knuth, though I haven't looked at it _super_ closely. — mgilson, May 05 '16 at 15:27
@DanGetz I thought it might be related to deep operations inside the CPU. But this might be relevant only for a really large amount of tests. Your answer seems really complete and consistent. Thank you for your sharing. — LiGhTx117, May 05 '16 at 17:46
On my Surface 3, Windows 10, with Python 3, bit shift in either direction takes longer than multiplication or division by 2... and also with powers of 2 of any size. I have no idea why, but its undeniable. Its measurable, consistent, and statistically significant. I wrote a simple script solely to test this with no other functionality just to see the trend. — CogitoErgoCogitoSum, Dec 06 '17 at 18:18

score 179 · Accepted Answer · edited Dec 15 '22 at 05:23

This seems to be because multiplication of small numbers is optimized in CPython 3.5, in a way that left shifts by small numbers are not. Positive left shifts always create a larger integer object to store the result, as part of the calculation, while for multiplications of the sort you used in your test, a special optimization avoids this and creates an integer object of the correct size. This can be seen in the source code of Python's integer implementation.

Because integers in Python are arbitrary-precision, they are stored as arrays of integer "digits", with a limit on the number of bits per integer digit. So in the general case, operations involving integers are not single operations, but instead need to handle the case of multiple "digits". In pyport.h, this bit limit is defined as 30 bits on 64-bit platform, or 15 bits otherwise. (I'll just call this 30 from here on to keep the explanation simple. But note that if you were using Python compiled for 32-bit, your benchmark's result would depend on if x were less than 32,768 or not.)

When an operation's inputs and outputs stay within this 30-bit limit, the operation can be handled in an optimized way instead of the general way. The beginning of the integer multiplication implementation is as follows:

static PyObject *
long_mul(PyLongObject *a, PyLongObject *b)
{
    PyLongObject *z;

    CHECK_BINOP(a, b);

    /* fast path for single-digit multiplication */
    if (Py_ABS(Py_SIZE(a)) <= 1 && Py_ABS(Py_SIZE(b)) <= 1) {
        stwodigits v = (stwodigits)(MEDIUM_VALUE(a)) * MEDIUM_VALUE(b);
#ifdef HAVE_LONG_LONG
        return PyLong_FromLongLong((PY_LONG_LONG)v);
#else
        /* if we don't have long long then we're almost certainly
           using 15-bit digits, so v will fit in a long.  In the
           unlikely event that we're using 30-bit digits on a platform
           without long long, a large v will just cause us to fall
           through to the general multiplication code below. */
        if (v >= LONG_MIN && v <= LONG_MAX)
            return PyLong_FromLong((long)v);
#endif
    }

So when multiplying two integers where each fits in a 30-bit digit, this is done as a direct multiplication by the CPython interpreter, instead of working with the integers as arrays. (MEDIUM_VALUE() called on a positive integer object simply gets its first 30-bit digit.) If the result fits in a single 30-bit digit, PyLong_FromLongLong() will notice this in a relatively small number of operations, and create a single-digit integer object to store it.

In contrast, left shifts are not optimized this way, and every left shift deals with the integer being shifted as an array. In particular, if you look at the source code for long_lshift(), in the case of a small but positive left shift, a 2-digit integer object is always created, if only to have its length truncated to 1 later: (my comments in /*** ***/)

static PyObject *
long_lshift(PyObject *v, PyObject *w)
{
    /*** ... ***/

    wordshift = shiftby / PyLong_SHIFT;   /*** zero for small w ***/
    remshift  = shiftby - wordshift * PyLong_SHIFT;   /*** w for small w ***/

    oldsize = Py_ABS(Py_SIZE(a));   /*** 1 for small v > 0 ***/
    newsize = oldsize + wordshift;
    if (remshift)
        ++newsize;   /*** here newsize becomes at least 2 for w > 0, v > 0 ***/
    z = _PyLong_New(newsize);

    /*** ... ***/
}

Integer division

You didn't ask about the worse performance of integer floor division compared to right shifts, because that fit your (and my) expectations. But dividing a small positive number by another small positive number is not as optimized as small multiplications, either. Every // computes both the quotient and the remainder using the function long_divrem(). This remainder is computed for a small divisor with a multiplication, and is stored in a newly-allocated integer object, which in this situation is immediately discarded.

Or at least, that was the case when this question was originally asked. In CPython 3.6, a fast path for small int // was added, so // now beats >> for small ints too.

It is an interesting observation with the division, thanks for pointing it out. It goes without saying that this is an excellent answer overall. — hilberts_drinking_problem, May 06 '16 at 20:09
A well-researched adn written answer to an excellent question. It might be interesting to show graphs for the timing for `x` outside the optimized range. — Barmar, May 10 '16 at 18:36

Times-two faster than bit-shift, for Python 3.x integers?

1 Answers1

Integer division

Linked

Related