Summary:
You guys are too awesome... I got my real code working. I took JoshAdel's advice, namely the following:
1) Changed all ndarray to typed memoryviews 2) Unrolled all the numpy array calculations manually 3) Used statically defined unsigned int for the index 4) Disabled the boundscheck and wraparound
And also, thanks a lot to Veedrac's insight!
Original post:
I know that python do these kind of code really slow:
import numpy as np
def func0():
x = 0.
for i in range(1000):
x += 1.
return
And if I change this to Cython, it can be much faster:
import numpy as np
cimport numpy as np
def func0():
cdef double x = 0.
for i in range(1000):
x += 1.
return
And here is the result:
# Python
10000 loops, best of 3: 58.9 µs per loop
# Cython
10000000 loops, best of 3: 66.8 ns per loop
However, now I have this kind of code, where it is not loops of single number, but loops of arrays. (Yes... I am solving a PDE so this happens).
I know the following example is stupid, but just try to get the idea of the type of calculation:
Pure python:
def func1():
array1 = np.random.rand(50000, 4)
array2 = np.random.rand(50000)
for i in range(1000):
array1[:, 0] += array2
array1[:, 1] += array2
array1[:, 2] += array2
array1[:, 3] += array2
return
Cython:
def func1():
cdef np.ndarray[np.double_t, ndim=2] array1 = np.random.rand(50000, 4)
cdef np.ndarray[np.double_t, ndim=1] array2 = np.random.rand(50000)
for i in range(1000):
array1[:, 0] += array2
array1[:, 1] += array2
array1[:, 2] += array2
array1[:, 3] += array2
return
And there is no improvement almost at all. In the meantime, I do know that Python is not good at handling these huge loops due to big overhead.
# Python
1 loops, best of 3: 299 ms per loop
# Cython
1 loops, best of 3: 300 ms per loop
Any suggestions in how I should improve these kind of code?