Why is numpy.power 60x slower than in-lining?

Question

Maybe I'm doing something odd, but maybe found a surprising performance loss when using numpy, seems consistent regardless of the power used. For instance when x is a random 100x100 array

x = numpy.power(x,3)

is about 60x slower than

x = x*x*x

A plot of the speed up for various array sizes reveals a sweet spot with arrays around size 10k and a consistent 5-10x speed up for other sizes.

enter image description here

Code to test below on your own machine (a little messy):

import numpy as np
from matplotlib import pyplot as plt
from time import time

ratios = []
sizes = []
for n in np.logspace(1,3,20).astype(int):
    a = np.random.randn(n,n)

    inline_times = []
    for i in range(100):
        t = time()
        b = a*a*a
        inline_times.append(time()-t)
    inline_time = np.mean(inline_times)

    pow_times = []
    for i in range(100):
        t = time()
        b = np.power(a,3)
        pow_times.append(time()-t)
    pow_time = np.mean(pow_times)

    sizes.append(a.size)
    ratios.append(pow_time/inline_time)

plt.plot(sizes,ratios)
plt.title('Performance of inline vs numpy.power')
plt.ylabel('Nx speed-up using inline')
plt.xlabel('Array size')
plt.xscale('log')
plt.show()

Anyone have an explanation?

As a side note, you really, really shouldn't be using `time` for timing code. The docs for `time` even tell you this. On some platforms, in some circumstances, it may be good enough… but it's still better to use `timeit`. — abarnert, Aug 12 '14 at 00:46
In general, a generic power method, accepting arbitrary values for both the base and the exponent (especially non-integers) will require more complexity to work. A simple `x*x*x` however can be expressed very easily using opcodes, which themselves can be easily translated into actual machine code. — poke, Aug 12 '14 at 00:49
@MikeGraham: I don't understand your response. `timeit` is how you profile your code; it's actually easier to use than `time`, as well as more likely to be right. (Even more so if you use IPython and its magic `%timeit`.) — abarnert, Aug 12 '14 at 01:24
@MikeGraham: Were you responding to the comment by someone else (who since deleted it) suggesting a wrapper function around `power` that unrolls small integers, and just put my name instead of whoever's that was? If so, then yeah, you may be right. — abarnert, Aug 12 '14 at 01:25
@abarnert, I meant to respond to Newmu's comment on my post...I don't know how I got so mixed up. — Mike Graham, Aug 12 '14 at 03:41
@MikeGraham: You @'d me instead of Newmu again when moving the comment. I don't mind, but it may mean he doesn't see your reply. — abarnert, Aug 12 '14 at 05:13
What time does good old `x**3` take? Isn't `x*x*x` superficial? — Vladimir F Героям слава, Aug 12 '14 at 06:35
[See also](http://stackoverflow.com/questions/1019740/speed-of-calculating-powers-in-python) — ali_m, Aug 12 '14 at 08:47
Note that `x * x ** 2` seems to be even faster than `x * x * x` to me, 'cause of the specialisation for squaring. — Veedrac, Aug 12 '14 at 08:49
@abarnert, I distinctly remember changing it, too. I think I'm going crazy. Thanks! — Mike Graham, Aug 12 '14 at 11:42

score 23 · Accepted Answer · answered Aug 12 '14 at 00:52

23

It's well known that multiplication of doubles, which your processor can do in a very fancy way, is very, very fast. pow is decidedly slower.

Some performance guides out there even advise people to plan for this, perhaps even in some way that might be a bit overzealous at times.

numpy special-cases squaring to make sure it's not too, too slow, but it sends cubing right off to your libc's pow, which isn't nearly as fast as a couple multiplications.

answered Aug 12 '14 at 00:52

Mike Graham

73,987
14
101
130

Thanks for feedback, I don't know much about numpy internals but with the performance gap being so large, (and explaining why numpy.square exists) do you think it would be a good thing to add cases to numpy.power for when the second op is an integer or a warning in the docs? – Newmu Aug 12 '14 at 01:01
@Newmu, Possibly so, not least because x*x*x probably isn't the fastest implementation of cubing x. That being said, this operation probably isn't the bottleneck in much if any code out there in need of optimizing – Mike Graham Aug 12 '14 at 11:41

score 5 · Answer 2 · answered Aug 12 '14 at 00:52

I suspect the issue is that np.power always does float exponentiation, and it doesn't know how to optimize or vectorize that on your platform (or, probably, most/all platforms), while multiplication is easy to toss into SSE, and pretty fast even if you don't.

Even if np.power were smart enough to do integer exponentiation separately, unless it unrolled small values into repeated multiplication, it still wouldn't be nearly as fast.

You can verify this pretty easily by comparing the time for int-to-int, int-to-float, float-to-int, and float-to-float powers vs. multiplication for a small array; int-to-int is about 5x as fast as the others—but still 4x slower than multiplication (although I tested with PyPy with a customized NumPy, so it's probably better for someone with the normal NumPy installed on CPython to give real results…)

Andrew Walker · Answer 3 · 2014-08-12T01:34:34.603

The performance of numpys power function scales very non-linearly with the exponent. Constrast this with the naive approach which does. The same type of scaling should exist, regardless of matrix size. Basically, unless the exponent is sufficiently large, you aren't going to see any tangible benefit.

import matplotlib.pyplot as plt
import numpy as np
import functools
import time

def timeit(func):
    @functools.wraps(func)
    def newfunc(*args, **kwargs):
        startTime = time.time()
        res = func(*args, **kwargs)
        elapsedTime = time.time() - startTime
        return (res, elapsedTime)
    return newfunc

@timeit
def naive_power(m, n):
    m = np.asarray(m)
    res = m.copy()
    for i in xrange(1,n):
        res *= m
    return res

@timeit
def fast_power(m, n):
    # elementwise power
    return np.power(m, n)

m = np.random.random((100,100))
n = 400

rs1 = []
ts1 = []
ts2 = []
for i in xrange(1, n):
    r1, t1 = naive_power(m, i)
    ts1.append(t1)

for i in xrange(1, n):
    r2, t2 = fast_power(m, i)
    ts2.append(t2)

plt.plot(ts1, label='naive')
plt.plot(ts2, label='numpy')
plt.xlabel('exponent')
plt.ylabel('time')
plt.legend(loc='upper left')

performance plot

Also, to nitpick, this is about elementwise power, not "matrix power", which is really a different operation. — Mike Graham, Aug 12 '14 at 01:15
@Andrew: just want to pass on a warning that the [timeit decorator you are using is unreliable](http://stackoverflow.com/q/1622943/190597). You need to make the function you are timing run longer so the time spent multiplying dominates the time being measured and to suppress things like OS process-scheduling flukes which may be causing of the jags. The timeit module does all this for you (as well as turn off garbage collection.) Here is [a version of your code](https://gist.github.com/unutbu/f587b4b78f7a33dbf824) using `timeit.timeit`. — unutbu, Sep 16 '14 at 14:29

Why is numpy.power 60x slower than in-lining?

3 Answers3

Linked