Fast Numpy Loops

Question

How do you optimize this code (without vectorizing, as this leads up to using the semantics of the calculation, which is quite often far from being non-trivial):

slow_lib.py:
import numpy as np

def foo():
    size = 200
    np.random.seed(1000031212)
    bar = np.random.rand(size, size)
    moo = np.zeros((size,size), dtype = np.float)
    for i in range(0,size):
        for j in range(0,size):
            val = bar[j]
            moo += np.outer(val, val)

The point is that such kind loops correspond quite often to operations where you have double sums over some vector operation.

This is quite slow:

>>t = timeit.timeit('foo()', 'from slow_lib import foo', number = 10)
>>print ("took: "+str(t))
took: 41.165681839

Ok, so then let's cynothize it and add type annotations likes there is no tomorrow:

c_slow_lib.pyx:
import numpy as np
cimport numpy as np
import cython
@cython.boundscheck(False)
@cython.wraparound(False)

def foo():
    cdef int size = 200
    cdef int i,j
    np.random.seed(1000031212)
    cdef np.ndarray[np.double_t, ndim=2] bar = np.random.rand(size, size)
    cdef np.ndarray[np.double_t, ndim=2] moo = np.zeros((size,size), dtype = np.float)
    cdef np.ndarray[np.double_t, ndim=1] val
    for i in xrange(0,size):
        for j in xrange(0,size):
            val = bar[j]
            moo += np.outer(val, val)


>>t = timeit.timeit('foo()', 'from c_slow_lib import foo', number = 10)
>>print ("took: "+str(t))
took: 42.3104710579

... ehr... what? Numba to the rescue!

numba_slow_lib.py:
import numpy as np
from numba import jit

size = 200
np.random.seed(1000031212)

bar = np.random.rand(size, size)

@jit
def foo():
    bar = np.random.rand(size, size)
    moo = np.zeros((size,size), dtype = np.float)
    for i in range(0,size):
        for j in range(0,size):
            val = bar[j]
            moo += np.outer(val, val)

>>t = timeit.timeit('foo()', 'from numba_slow_lib import foo', number = 10)
>>print("took: "+str(t))
took: 40.7327859402

So is there really no way to speed this up? The point is:

if I convert the inner loop into a vectorized version (building a larger matrix representing the inner loop and then calling np.outer on the larger matrix) I get much faster code.
if I implement something similar in Matlab (R2016a) this performs quite well due to JIT.

Neither cython nor jit are accelerating for you're already running C code (via np.outer). The problem here is actually the loop itself, you need to change it's inner structure so those methods can actually be accelerated. — pekapa, Jun 13 '16 at 15:24
I know that vectorizing the inner (or both) loops will accelerate the code significantly. My point is that apparently the loop creates some significant overhead that shouldn't be there. In other words: Why is calling np.outer 200 times so much slower than calling np.outer once on a matrix with say 200 rows (vectorizing) as opposed to say Matlab loop where this is a non-issue... And how can that be overcome? — ndbd, Jun 13 '16 at 15:33
I don't think I can help any further, but have a look at this answer about how each (Python and Matlab) treats loops: http://stackoverflow.com/a/17242928/2752305 — pekapa, Jun 13 '16 at 15:42
Well, one thing is the function overhead with calling it 200 times. This slows down at both Python and MATLAB levels. JIT has significantly improved it though in the recent times and NumPy might need to catch up on that (don't have much info on it). — Divakar, Jun 13 '16 at 15:58
Stop using `np.outer` in the Cython and NumPy versions. Use a manual loop, and you should get better performance. — user2357112, Jun 13 '16 at 16:09
Also, you're not calling `np.outer` 200 times. You're calling it 40000 times. — user2357112, Jun 13 '16 at 16:11
@user2357112 I was talking about vectorizing the inner loop. Which already gives a significant performance boost in python... — ndbd, Jul 07 '16 at 08:08

hpaulj · Accepted Answer · 2016-06-14T01:12:01.833

Here's the code for outer:

def outer(a, b, out=None):    
    a = asarray(a)
    b = asarray(b)
    return multiply(a.ravel()[:, newaxis], b.ravel()[newaxis,:], out)

So each call to outer involves a number of python calls. Those eventually call compiled code to perform the multiplication. But each incurs an overhead that has nothing to do with the size of your arrays.

So 200 (200**2?) calls to outer will have all that overhead, whereas one call to outer with all 200 rows has one overhead set, followed by one fast compiled operation.

cython and numba don't compile or otherwise bypass the Python code in outer. All they can do is streamline the iteration code that you wrote - and that isn't consuming much time.

Without getting into details, the MATLAB jit must be able to replace the 'outer' with faster code - it rewrites the iteration. But my experience with MATLAB dates from a time before its jit.

For real speed improvements with cython and numba you need to use primitive numpy/python code all the way down. Or better yet focus your effort on slow inner pieces.

Replacing your outer with a streamlined version cuts run time about in half:

def foo1(N):
        size = N
        np.random.seed(1000031212)
        bar = np.random.rand(size, size)
        moo = np.zeros((size,size), dtype = np.float)
        for i in range(0,size):
                for j in range(0,size):
                        val = bar[j]
                        moo += val[:,None]*val   
        return moo

With the full N=200 your function took 17s per loop. If I replace the inner two lines with pass (no calculation), time drops to 3ms per loop. In other words, the outer loop mechanism is not a big time consumer, at least not compared to many calls to outer().

Divakar · Answer 2 · 2016-06-13T15:37:34.500

10

Memory permitting, you can use np.einsum to perform those heavy calculations in a vectorized manner, like so -

moo = size*np.einsum('ij,ik->jk',bar,bar)

One can also use np.tensordot -

moo = size*np.tensordot(bar,bar,axes=(0,0))

Or simply np.dot -

moo = size*bar.T.dot(bar)

edited Jun 13 '16 at 15:37

answered Jun 13 '16 at 15:31

Divakar

218,885
19
262
358

thx, appreciated, but I already know that vectorizing the code speeds up the computation. Sometimes it's easy to see how to vectorize the code (as done here with einsum), but sometimes one needs really great insight into the underlying problem, and it's much easier to write the code in loops. What to do then? – ndbd Jun 13 '16 at 15:36
1

@ndbd If you are asking for a generic case on how to speed a code, I would say it depends. But I from my personal experience have found NumPy ufuncs and functions like `einsum` and dot product based funcs to be useful when we deal with multiplications and reductions that are vectorized approaches at Python level. For a generic case, I can't really say anything noteworthy I think, sorry! – Divakar Jun 13 '16 at 15:43

score 6 · Answer 3 · answered Jun 13 '16 at 18:13

Many tutorials and demonstrations of Cython, Numba, etc. make it seem as if these tools can speed up your code automagically, but in practice, this is often not the case: You'll need to modify your code a little to extract the best performance. If you had already implemented some degree of vectorization, it usually means writing out ALL the loops. Reasons Numpy array operations are non-optimal include:

Lots of temporary arrays are created and looped over;
Significant per-call overhead if the arrays are small;
Short-circuiting logic can't be implemented, because arrays are processed as a whole;
Sometimes the optimal algorithm can't be expressed using array expressions and you settle for an algorithm with a worse time complexity.

Using Numba or Cython wont optimize these problems away! Instead, these tools allow you to write loopy code that is much faster than plain Python.

Also, for Numba specifically, you should be aware of the difference between "object mode" and "nopython mode". The tight loops from your example have to run in nopython mode to provide any significant speedup. However, numpy.outer is not yet supported by Numba, resulting in the function to be compiled in object mode. Decorate with jit(nopython=True) to let such cases throw an exception.

Example to demonstrate a speedup is indeed possible:

import numpy as np
from numba import jit

@jit
def foo_nb(bar):
    size = bar.shape[0]
    moo = np.zeros((size, size))
    for i in range(0,size):
        for j in range(0,size):
            val = bar[j]
            moo += np.outer(val, val)
    return moo

@jit
def foo_nb2(bar):
    size = bar.shape[0]
    moo = np.zeros((size, size))
    for i in range(size):
        for j in range(size):
            for k in range(0,size):
                for l in range(0,size):
                    moo[k,l] += bar[j,k] * bar[j,l]
    return moo

size = 100
bar = np.random.rand(size, size)

np.allclose(foo_nb(bar), foo_nb2(bar))
# True

%timeit foo_nb(bar)
# 1 loop, best of 3: 816 ms per loop
%timeit foo_nb2(bar)
# 10 loops, best of 3: 176 ms per loop

score -2 · Answer 4 · answered Jun 13 '16 at 22:48

The example you show us is kind of inefficient algorithm, since you calculate the same outer product multiple times. The resulting time complexity is O(n^4). It can be reduced to n^3.

for i in range(0,size):
    val = bar[i]
    moo += size * np.outer(val, val)

Fast Numpy Loops

4 Answers4

Linked