Large Performance difference when summing ints vs floats in Cython vs NumPy

Question

I am summing each element in a 1D array using either Cython or NumPy. When summing integers Cython is ~20% faster. When summing floats, Cython is ~2.5x slower. Below are the two simple functions used.

#cython: boundscheck=False
#cython: wraparound=False

def sum_int(ndarray[np.int64_t] a):
    cdef:
        Py_ssize_t i, n = len(a)
        np.int64_t total = 0

    for i in range(n):
        total += a[i]
    return total 

def sum_float(ndarray[np.float64_t] a):
    cdef:
        Py_ssize_t i, n = len(a)
        np.float64_t total = 0

    for i in range(n):
        total += a[i]
    return total

Timings

Create two arrays of 1 million elements each:

a_int = np.random.randint(0, 100, 10**6)
a_float = np.random.rand(10**6)

%timeit sum_int(a_int)
394 µs ± 30 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

%timeit a_int.sum()
490 µs ± 34.2 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

%timeit sum_float(a_float)
982 µs ± 10.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

%timeit a_float.sum()
383 µs ± 4.42 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Additional points

NumPy is outperforming (by quite a large margin) with floats and even beats its own integer sum.
The performance difference for sum_float is the same with the boundscheck and wraparound directives missing. Why?
Converting the integer numpy array in sum_int to a C pointer (np.int64_t *arr = <np.int64_t *> a.data) improves performance by an additional 25%. Doing so for the floats does nothing

Main Question

How can I get the same performance in Cython with floats that I do with integers?

EDIT - Just Counting is Slow?!?

I wrote an even simpler function that just counts the number of iterations. The first stores the count as an int, the latter as a double.

def count_int():
    cdef:
        Py_ssize_t i, n = 1000000
        int ct=0

    for i in range(n):
        ct += 1
    return ct

def count_double():
    cdef:
        Py_ssize_t i, n = 1000000
        double ct=0

    for i in range(n):
        ct += 1
    return ct

Timings of counting

I ran these just once (afraid of caching). No idea if the loop is actually being executed for the integer, but count_double has the same performance as the sum_float from above. This is crazy...

%timeit -n 1 -r 1 count_int()
1.1 µs ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)

%timeit -n 1 -r 1 count_double()
971 µs ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)

Do you know how the extract the relevant part of the Cython-generated C code from all the boilerplate? — abarnert, Mar 20 '18 at 16:15
Compiler probably optimizes for count_int (it is a multiplication in the end) but not for double because + isn’t associative for floating point arithmetics — ead, Mar 20 '18 at 19:02

score 15 · Accepted Answer · edited Mar 21 '18 at 10:31

I'm not going to answer all your questions, but only (in my eyes) the most interesting ones.

Let start with your counting example:

the compiler is able to optimize the for-loop in the integer-case - the resulting binary doesn't not calculate anything - it only hast to return the value precalculated during the compilation phase.
This is not the case for the double-case, because due to rounding errors, the result will not be 1.0*10**6 and because cython compiles in IEEE 754 (not -ffast-math) mode per default.

This you must keep it mind when looking at your cython-code: the compiler is not allowed to rearrange the summations (IEEE 754) and because the result of the first summations is needed for the next there is only one long line in which all operations wait.

But the most crucial insight: the numpy is not doing the same as your cython code:

>>> sum_float(a_float)-a_float.sum()
2.9103830456733704e-08

Yep, nobody told numpy (differently from your cython code) that the sum has to be calculated like this

((((a_1+a2)+a3)+a4)+...

And numpy take advantage of it in two ways:

it performs pairwise summation (kind of), which leads to a smaller rounding error.
it calculates sum in chunks (the code of python is somewhat hard to understand, here is the corresponding template and further down the listing of the used function pairwise_sum_DOUBLE)

The second point is the reason for the speed-up your are observing, the calculation happens similar to the following schema (at least what I understand from the source code below):

a1  + a9 + .....  = r1 
a2  + a10 + ..... = r2
..
a8  + a16 +       = r8

----> sum=r1+....+r8

The advantage of this kind of summation: the result of a2+a10 doesn't depend on a1+a9 and these both values can be calculated simultaneously (e.g. pipelining) on modern CPUs, which leads to the speed-up you are observing.

For what it is worth, on my machine the cython-integer-sum is slower than numpy's.

The need of taking the stride of the numpy-array into account (which is only known at the run-time, see also this question about vectorization) prevents some optimizations. A workaround is to use memory-views, for which you can make clear, that data is continuous, i.e.:

def sum_int_cont(np.int64_t[::1] a):

Which leads on my machine to a significant speed-up (factor 2):

%timeit sum_int(a_int)
2.64 ms ± 46.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%timeit sum_int_cont(a_int)
1.31 ms ± 19 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

%timeit a_int.sum()
2.1 ms ± 105 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

It's true, that in this case using memory-views for doubles doesn't bring any speed-up (don't know why), but in general it simplifies the life of the optimizer. For example combining the memory-view-variant with -ffast-math compile options, which would allow the associativity, leads to a performance comparable with numpy:

%%cython -c=-ffast-math
cimport numpy as np
def sum_float_cont(np.float64_t[::1] a):
    cdef:
        Py_ssize_t i, n = len(a)
        np.float64_t total = 0

    for i in range(n):
        total += a[i]
    return total

And now:

>>> %timeit sum_float(a_float)
3.46 ms ± 226 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
>>> %timeit sum_float_cont(a_float)
1.87 ms ± 44 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
>>> %timeit a_float.sum()
1.41 ms ± 88.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Listing of pairwise_sum_DOUBLE:

/*
 * Pairwise summation, rounding error O(lg n) instead of O(n).
 * The recursion depth is O(lg n) as well.
 * when updating also update similar complex floats summation
 */
static npy_double
pairwise_sum_DOUBLE(npy_double *a, npy_uintp n, npy_intp stride)
{
    if (n < 8) {
        npy_intp i;
        npy_double res = 0.;
        for (i = 0; i < n; i++) {
            res += (a[i * stride]);
        }
        return res;
    }
    else if (n <= PW_BLOCKSIZE) {
        npy_intp i;
        npy_double r[8], res;

        /*
         * sum a block with 8 accumulators
         * 8 times unroll reduces blocksize to 16 and allows vectorization with
         * avx without changing summation ordering
         */
        r[0] = (a[0 * stride]);
        r[1] = (a[1 * stride]);
        r[2] = (a[2 * stride]);
        r[3] = (a[3 * stride]);
        r[4] = (a[4 * stride]);
        r[5] = (a[5 * stride]);
        r[6] = (a[6 * stride]);
        r[7] = (a[7 * stride]);

        for (i = 8; i < n - (n % 8); i += 8) {
            r[0] += (a[(i + 0) * stride]);
            r[1] += (a[(i + 1) * stride]);
            r[2] += (a[(i + 2) * stride]);
            r[3] += (a[(i + 3) * stride]);
            r[4] += (a[(i + 4) * stride]);
            r[5] += (a[(i + 5) * stride]);
            r[6] += (a[(i + 6) * stride]);
            r[7] += (a[(i + 7) * stride]);
        }

        /* accumulate now to avoid stack spills for single peel loop */
        res = ((r[0] + r[1]) + (r[2] + r[3])) +
              ((r[4] + r[5]) + (r[6] + r[7]));

        /* do non multiple of 8 rest */
        for (; i < n; i++) {
            res += (a[i * stride]);
        }
        return res;
    }
    else {
        /* divide by two but avoid non-multiples of unroll factor */
        npy_uintp n2 = n / 2;
        n2 -= n2 % 8;
        return pairwise_sum_DOUBLE(a, n2, stride) +
               pairwise_sum_DOUBLE(a + n2 * stride, n - n2, stride);
    }
}