It might be surprising, but summing up some values in the matrix isn't an easy task. This and this answers of mine give some insights, so I'm not going to repeat in much detail, but to get the best performance one must utilize the cache and SIMD/pipeline-nature floating operations on modern CPUs in the best possible way.
Numpy gets all of the above right and it is quite hard to beat. It is not impossible, but one must really be very intimate with low level optimization to be successful - and one should not expect much of improvement. Naive implementation won't be able to beat Numpy at all.
Here are my tries using cython and numba, where all of the speed-up comes from parallelization.
Let's start with the baseline-algorithm:
def bin2d(a,K):
m_bins = a.shape[0]//K
n_bins = a.shape[1]//K
return a.reshape(m_bins, K, n_bins, K).sum(3).sum(1)
and measure it's performance:
import numpy as np
N,K=2000,50
a=np.arange(N*N, dtype=np.float64).reshape(N,N)
%timeit bin2d(a,K)
# 2.76 ms ± 107 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Cython:
Here is my implementation with Cython, which uses OpenMP for parallelization. To keep things simple I perform summation in-place, thus passed array will be changed (which is not the case for numpy's version):
%%cython -a -c=/openmp --link-args=/openmp -c=/arch:AVX512
import numpy as np
import cython
from cython.parallel import prange
cdef extern from *:
"""
//assumes direct memory and row-major-order (i.e. rows are continuous)
double calc_bin(double *ptr, int N, int y_offset){
double *row = ptr;
for(int y=1;y<N;y++){
row+=y_offset; //next row
//there is no dependencies, so the summation can be done in parallel
for(int x=0;x<N;x++){
ptr[x]+=row[x];
}
}
double res=0.0;
for(int x=0;x<N;x++){
//could be made slightly faster (summation is not in parallel), but is it needed?
res+=ptr[x];
}
return res;
}
"""
double calc_bin(double *ptr, int N, int y_offset) nogil
@cython.boundscheck(False)
@cython.wraparound(False)
def cy_bin2d_parallel(double[:, ::1] a, int K):
cdef int y_offset = a.shape[0]
cdef int m_bins = a.shape[0]//K
cdef int n_bins = a.shape[1]//K
cdef double[:,:] res = np.empty((m_bins, n_bins), dtype=np.float64)
cdef int i,j,k
for k in prange(m_bins*n_bins, nogil=True):
i = k//m_bins
j = k%m_bins
res[i,j] = calc_bin(&a[i*K, j*K], K, y_offset)
return res.base
And now (after verifying that the results are the same):
a=np.arange(N*N, dtype=np.float64).reshape(N,N)
%timeit cy_bin2d_parallel(a,K)
# 1.51 ms ± 25.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
# without parallelization: 2.93 ms
which is about 30% faster. Using -c=/arch:AVX512
(otherwise compiled only with /Ox
with MSVC) as proposed by @max9111, made the single-threaded version about 20% faster but brought only about 5% for the parallel version.
Numba:
There is the same algorithm compiled with numba (which often can beat cython due to better performance of the clang-compiler) - but the result is slightly slower than cython but beats numpy by about 20%:
import numba as nb
@nb.njit(parallel=True)
def nb_bin2d_parallel(a, K):
m_bins = a.shape[0]//K
n_bins = a.shape[1]//K
res = np.zeros((m_bins, n_bins), dtype=np.float64)
for k in nb.prange(m_bins*n_bins):
i = k//m_bins
j = k%m_bins
for y in range(i*K+1, (i+1)*K):
for x in range(j*K, (j+1)*K):
a[i*K, x] += a[y,x]
s=0.0
for x in range(j*K, (j+1)*K):
s+=a[i*K, x]
res[i,j] = s
return res
and now:
a=np.arange(N*N, dtype=np.float64).reshape(N,N)
%timeit nb_bin2d_parallel(a,K)
# 1.98 ms ± 162 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
# without parallelization: 5.8 ms
In a nutshell: I guess it is possible to beat the above, but there is no free lunch anymore as numpy does its job pretty well. The most potential is probably in the parallelization, but it is limited due to the memory-bandwidth-bound nature of the problem (and probably one should use a smarter strategy as mine - for 50*50 summation one possible still can see an overhead for creation/management of the threads).
There is another faster (at least for sizes of about 2000) attempt with Cython, which perform the summation not for small 50-element parts, but for the whole row thus reducing the overhead (but may have more cache misses once the row is larger than 1-2k):
%%cython -a --verbose -c=/openmp --link-args=/openmp -c=/arch:AVX512
import numpy as np
import cython
from cython.parallel import prange
cdef extern from *:
"""
void calc_bin_row(double *ptr, int N, int y_offset, double* out){
double *row = ptr;
for(int y=1;y<N;y++){
row+=y_offset; //next row
for(int x=0;x<y_offset;x++){
ptr[x]+=row[x];
}
}
double res=0.0;
int i=0;
int k=0;
for(int x=0;x<y_offset;x++){//could be made faster, but is it needed?
res+=ptr[x];
k++;
if(k==N){
k=0;
out[i]=res;
i++;
res=0.0;
}
}
}
"""
void calc_bin_row(double *ptr, int N, int y_offset, double* out) nogil
@cython.boundscheck(False)
@cython.wraparound(False)
def cy_bin2d_parallel_rowise(double[:, ::1] a, int K):
cdef int y_offset = a.shape[1]
cdef int m_bins = a.shape[0]//K
cdef int n_bins = a.shape[1]//K
cdef double[:,:] res = np.empty((m_bins, n_bins), dtype=np.float64)
cdef int i,j,k
for k in prange(0, y_offset, K, nogil=True):
calc_bin_row(&a[k, 0], K, y_offset, &res[k//K, 0])
return res.base
which is already faster singlethreaded (2ms) and about 60% (1.27 ms ± 50.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)) faster with parallelization.