I am summing each element in a 1D array using either Cython or NumPy. When summing integers Cython is ~20% faster. When summing floats, Cython is ~2.5x slower. Below are the two simple functions used.
#cython: boundscheck=False
#cython: wraparound=False
def sum_int(ndarray[np.int64_t] a):
cdef:
Py_ssize_t i, n = len(a)
np.int64_t total = 0
for i in range(n):
total += a[i]
return total
def sum_float(ndarray[np.float64_t] a):
cdef:
Py_ssize_t i, n = len(a)
np.float64_t total = 0
for i in range(n):
total += a[i]
return total
Timings
Create two arrays of 1 million elements each:
a_int = np.random.randint(0, 100, 10**6)
a_float = np.random.rand(10**6)
%timeit sum_int(a_int)
394 µs ± 30 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit a_int.sum()
490 µs ± 34.2 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit sum_float(a_float)
982 µs ± 10.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit a_float.sum()
383 µs ± 4.42 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Additional points
- NumPy is outperforming (by quite a large margin) with floats and even beats its own integer sum.
- The performance difference for
sum_float
is the same with theboundscheck
andwraparound
directives missing. Why? - Converting the integer numpy array in
sum_int
to a C pointer (np.int64_t *arr = <np.int64_t *> a.data
) improves performance by an additional 25%. Doing so for the floats does nothing
Main Question
How can I get the same performance in Cython with floats that I do with integers?
EDIT - Just Counting is Slow?!?
I wrote an even simpler function that just counts the number of iterations. The first stores the count as an int, the latter as a double.
def count_int():
cdef:
Py_ssize_t i, n = 1000000
int ct=0
for i in range(n):
ct += 1
return ct
def count_double():
cdef:
Py_ssize_t i, n = 1000000
double ct=0
for i in range(n):
ct += 1
return ct
Timings of counting
I ran these just once (afraid of caching). No idea if the loop is actually being executed for the integer, but count_double
has the same performance as the sum_float
from above. This is crazy...
%timeit -n 1 -r 1 count_int()
1.1 µs ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
%timeit -n 1 -r 1 count_double()
971 µs ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)