How do you optimize this code (without vectorizing, as this leads up to using the semantics of the calculation, which is quite often far from being non-trivial):
slow_lib.py:
import numpy as np
def foo():
size = 200
np.random.seed(1000031212)
bar = np.random.rand(size, size)
moo = np.zeros((size,size), dtype = np.float)
for i in range(0,size):
for j in range(0,size):
val = bar[j]
moo += np.outer(val, val)
The point is that such kind loops correspond quite often to operations where you have double sums over some vector operation.
This is quite slow:
>>t = timeit.timeit('foo()', 'from slow_lib import foo', number = 10)
>>print ("took: "+str(t))
took: 41.165681839
Ok, so then let's cynothize it and add type annotations likes there is no tomorrow:
c_slow_lib.pyx:
import numpy as np
cimport numpy as np
import cython
@cython.boundscheck(False)
@cython.wraparound(False)
def foo():
cdef int size = 200
cdef int i,j
np.random.seed(1000031212)
cdef np.ndarray[np.double_t, ndim=2] bar = np.random.rand(size, size)
cdef np.ndarray[np.double_t, ndim=2] moo = np.zeros((size,size), dtype = np.float)
cdef np.ndarray[np.double_t, ndim=1] val
for i in xrange(0,size):
for j in xrange(0,size):
val = bar[j]
moo += np.outer(val, val)
>>t = timeit.timeit('foo()', 'from c_slow_lib import foo', number = 10)
>>print ("took: "+str(t))
took: 42.3104710579
... ehr... what? Numba to the rescue!
numba_slow_lib.py:
import numpy as np
from numba import jit
size = 200
np.random.seed(1000031212)
bar = np.random.rand(size, size)
@jit
def foo():
bar = np.random.rand(size, size)
moo = np.zeros((size,size), dtype = np.float)
for i in range(0,size):
for j in range(0,size):
val = bar[j]
moo += np.outer(val, val)
>>t = timeit.timeit('foo()', 'from numba_slow_lib import foo', number = 10)
>>print("took: "+str(t))
took: 40.7327859402
So is there really no way to speed this up? The point is:
- if I convert the inner loop into a vectorized version (building a larger matrix representing the inner loop and then calling np.outer on the larger matrix) I get much faster code.
- if I implement something similar in Matlab (R2016a) this performs quite well due to JIT.