Python: why are * and ** faster than / and sqrt()?

Question

While optimising my code I realised the following:

>>> from timeit import Timer as T
>>> T(lambda : 1234567890 / 4.0).repeat()
[0.22256922721862793, 0.20560789108276367, 0.20530295372009277]
>>> from __future__ import division
>>> T(lambda : 1234567890 / 4).repeat()
[0.14969301223754883, 0.14155197143554688, 0.14141488075256348]
>>> T(lambda : 1234567890 * 0.25).repeat()
[0.13619112968444824, 0.1281130313873291, 0.12830305099487305]

and also:

>>> from math import sqrt
>>> T(lambda : sqrt(1234567890)).repeat()
[0.2597470283508301, 0.2498021125793457, 0.24994492530822754]
>>> T(lambda : 1234567890 ** 0.5).repeat()
[0.15409398078918457, 0.14059877395629883, 0.14049601554870605]

I assume it has to do with the way python is implemented in C, but I wonder if anybody would care to explain why is so?

The answer you accepted for your question (which I assume answers your real question) does not have much to do with your question title. Could you edit it to have something to do with constant folding? — Zan Lynx, Nov 10 '11 at 01:37
@ZanLynx - Hi. Would you mind to clarify? I find that the the question title expresses precisely what I wanted to know (why X is faster than Y) and that the answer I selected does precisely that... Seems a perfect match to me... but maybe I am overlooking something? — mac, Nov 10 '11 at 06:50
Multiplication and power functions are always faster than division and sqrt() functions because of their very nature. Division and root operations generally have to use a series of finer and finer approximations and cannot go directly to the right answer like multiplication can. — Zan Lynx, Nov 10 '11 at 17:34
I feel like the question title should say something about the fact that the values are all literal constants, which turns out to be key to the answer. On typical hardware, integer and FP multiply and add/subtract are cheap; integer and FP div, and FP sqrt, are all expensive (maybe 3x worse latency, and 10x worse throughput than FP mul). (Most CPUs implement these operations in hardware as single asm instructions, unlike cube-root or pow() or whatever). — Peter Cordes, Sep 22 '16 at 22:02
But I wouldn't be surprised if Python interpreter overhead still dwarfed the difference between mul and div asm instructions. Fun fact: on x86, FP division is usually higher performance than integer division. (http://agner.org/optimize/). A 64-bit integer divide on Intel Skylake has a latency of 42-95 cycles, vs. 26 cycles for 32-bit integer, vs. 14 cycles for double-precision FP. (64-bit integer multiply is 3 cycle latency, FP mul is 4). The throughput differences are even larger (int/FP mul and add are all at least one per clock, but division and sqrt aren't fully pipelined.) — Peter Cordes, Sep 22 '16 at 22:07

NPE · Accepted Answer · 2011-11-16T14:52:05.957

The (somewhat unexpected) reason for your results is that Python seems to fold constant expressions involving floating-point multiplication and exponentiation, but not division. math.sqrt() is a different beast altogether since there's no bytecode for it and it involves a function call.

On Python 2.6.5, the following code:

x1 = 1234567890.0 / 4.0
x2 = 1234567890.0 * 0.25
x3 = 1234567890.0 ** 0.5
x4 = math.sqrt(1234567890.0)

compiles to the following bytecodes:

  # x1 = 1234567890.0 / 4.0
  4           0 LOAD_CONST               1 (1234567890.0)
              3 LOAD_CONST               2 (4.0)
              6 BINARY_DIVIDE       
              7 STORE_FAST               0 (x1)

  # x2 = 1234567890.0 * 0.25
  5          10 LOAD_CONST               5 (308641972.5)
             13 STORE_FAST               1 (x2)

  # x3 = 1234567890.0 ** 0.5
  6          16 LOAD_CONST               6 (35136.418286444619)
             19 STORE_FAST               2 (x3)

  # x4 = math.sqrt(1234567890.0)
  7          22 LOAD_GLOBAL              0 (math)
             25 LOAD_ATTR                1 (sqrt)
             28 LOAD_CONST               1 (1234567890.0)
             31 CALL_FUNCTION            1
             34 STORE_FAST               3 (x4)

As you can see, multiplication and exponentiation take no time at all since they're done when the code is compiled. Division takes longer since it happens at runtime. Square root is not only the most computationally expensive operation of the four, it also incurs various overheads that the others do not (attribute lookup, function call etc).

If you eliminate the effect of constant folding, there's little to separate multiplication and division:

In [16]: x = 1234567890.0

In [17]: %timeit x / 4.0
10000000 loops, best of 3: 87.8 ns per loop

In [18]: %timeit x * 0.25
10000000 loops, best of 3: 91.6 ns per loop

math.sqrt(x) is actually a little bit faster than x ** 0.5, presumably because it's a special case of the latter and can therefore be done more efficiently, in spite of the overheads:

In [19]: %timeit x ** 0.5
1000000 loops, best of 3: 211 ns per loop

In [20]: %timeit math.sqrt(x)
10000000 loops, best of 3: 181 ns per loop

edit 2011-11-16: Constant expression folding is done by Python's peephole optimizer. The source code (peephole.c) contains the following comment that explains why constant division isn't folded:

    case BINARY_DIVIDE:
        /* Cannot fold this operation statically since
           the result can depend on the run-time presence
           of the -Qnew flag */
        return 0;

The -Qnew flag enables "true division" defined in PEP 238.

@missingno: It's unclear to me why any such "protection" would be necessary since both arguments are known at compile time, and so is the result (which is one of: +inf, -inf, NaN). — NPE, Nov 09 '11 at 16:44
maybe the `from __future__ import division` test use a similar method. — Simon Bergot, Nov 09 '11 at 17:13
Constant folding works with `/` in Python 3, and with `//` in Python 2 and 3. So most likely this is a result of the fact that `/` can have different meanings in Python 2. Maybe when constant folding is done, it isn't known yet whether `from __future__ import division` is in effect? — interjay, Nov 09 '11 at 18:00
@aix - `1./0.` in Python 2.7 does not result in `NaN` but a `ZeroDivisionError`. — detly, Nov 10 '11 at 02:02
@detly: Good point, thanks! I don't think this invalidates anything said thus far, except my remark about infinity and `NaN`s. — NPE, Nov 10 '11 at 08:13
_since they're done when the code is **compiled**_ Python is interpreted isn't it? — Caridorc, Jan 23 '15 at 13:27
@Caridorc : Python is compiled into bytecode (.pyc files), which is then interpreted by the Python run time. Bytecode is not the same as Assembly/Machine Code (which is what a C compiler generates for instance). The dis module can be used to examine the bytecode that a given code fragment compiles to. — Tony Suffolk 66, Apr 15 '15 at 16:22

Python: why are * and ** faster than / and sqrt()?

1 Answers1

Linked