2

I'm looking for a fast way for binning of 2D array. I saw this topic, bin a 2d array in numpy in this website and found the following solution with numpy can be fast data processing.

Edit: Binning is like this, if you have,

array([[  0,   0,   0,   0,   0,   0],
       [  0, 144,   0,   0,   0,   0],
       [  0,   0,   0, 144,   3,   0],
       [109, 112, 116, 121,  40,  91]])

output by binning is going to be

array([[144,   0,   0],
       [221, 381, 134]])

As you can see, each elements of the output array are the summed value of 2x2 arrays in original array in this case. These bins are about 50x50 in my case.

If a has the shape m, n, the reshape should have the form a.reshape(m_bins, m // m_bins, n_bins, n // n_bins)

But because I have to work with a big array (more than 1k x 1k), this takes tens of milli seconds on my PC. Is there any way to get this done faster, such as using C in Cython?

Harvey
  • 21
  • 3

2 Answers2

2

It might be surprising, but summing up some values in the matrix isn't an easy task. This and this answers of mine give some insights, so I'm not going to repeat in much detail, but to get the best performance one must utilize the cache and SIMD/pipeline-nature floating operations on modern CPUs in the best possible way.

Numpy gets all of the above right and it is quite hard to beat. It is not impossible, but one must really be very intimate with low level optimization to be successful - and one should not expect much of improvement. Naive implementation won't be able to beat Numpy at all.

Here are my tries using cython and numba, where all of the speed-up comes from parallelization.

Let's start with the baseline-algorithm:

def bin2d(a,K):
    m_bins = a.shape[0]//K
    n_bins = a.shape[1]//K
    return a.reshape(m_bins, K, n_bins, K).sum(3).sum(1)

and measure it's performance:

import numpy as np
N,K=2000,50
a=np.arange(N*N, dtype=np.float64).reshape(N,N)
%timeit bin2d(a,K)
# 2.76 ms ± 107 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Cython:

Here is my implementation with Cython, which uses OpenMP for parallelization. To keep things simple I perform summation in-place, thus passed array will be changed (which is not the case for numpy's version):

%%cython -a  -c=/openmp --link-args=/openmp -c=/arch:AVX512
import numpy as np
import cython
from cython.parallel import prange
cdef extern from *:
    """
    //assumes direct memory and row-major-order (i.e. rows are continuous)
    double calc_bin(double *ptr, int N, int y_offset){
       double *row = ptr;
       for(int y=1;y<N;y++){
           row+=y_offset; //next row
           //there is no dependencies, so the summation can be done in parallel
           for(int x=0;x<N;x++){
               ptr[x]+=row[x];
           }
       }
       double res=0.0;
       for(int x=0;x<N;x++){
           //could be made slightly faster (summation is not in parallel), but is it needed?
           res+=ptr[x];
       }
       return res;
    }
    """
    double calc_bin(double *ptr, int N, int y_offset) nogil


@cython.boundscheck(False)
@cython.wraparound(False)
def cy_bin2d_parallel(double[:, ::1] a, int K):
    cdef int y_offset = a.shape[0]
    cdef int m_bins = a.shape[0]//K
    cdef int n_bins = a.shape[1]//K
    cdef double[:,:] res = np.empty((m_bins, n_bins), dtype=np.float64)
    cdef int i,j,k
    for k in prange(m_bins*n_bins, nogil=True):
        i = k//m_bins
        j = k%m_bins
        res[i,j] = calc_bin(&a[i*K, j*K], K, y_offset)
    return res.base

And now (after verifying that the results are the same):

a=np.arange(N*N, dtype=np.float64).reshape(N,N)
%timeit cy_bin2d_parallel(a,K)
# 1.51 ms ± 25.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
# without parallelization: 2.93 ms

which is about 30% faster. Using -c=/arch:AVX512 (otherwise compiled only with /Ox with MSVC) as proposed by @max9111, made the single-threaded version about 20% faster but brought only about 5% for the parallel version.

Numba:

There is the same algorithm compiled with numba (which often can beat cython due to better performance of the clang-compiler) - but the result is slightly slower than cython but beats numpy by about 20%:

import numba as nb
@nb.njit(parallel=True)
def nb_bin2d_parallel(a, K):
    m_bins = a.shape[0]//K
    n_bins = a.shape[1]//K
    res = np.zeros((m_bins, n_bins), dtype=np.float64)

    for k in nb.prange(m_bins*n_bins):
        i = k//m_bins
        j = k%m_bins
        for y in range(i*K+1, (i+1)*K):
            for x in range(j*K, (j+1)*K):
                a[i*K, x] += a[y,x]
        s=0.0
        for x in range(j*K, (j+1)*K):
            s+=a[i*K, x]
        res[i,j] = s
    return res

and now:

a=np.arange(N*N, dtype=np.float64).reshape(N,N)
%timeit nb_bin2d_parallel(a,K)
# 1.98 ms ± 162 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
# without parallelization: 5.8 ms

In a nutshell: I guess it is possible to beat the above, but there is no free lunch anymore as numpy does its job pretty well. The most potential is probably in the parallelization, but it is limited due to the memory-bandwidth-bound nature of the problem (and probably one should use a smarter strategy as mine - for 50*50 summation one possible still can see an overhead for creation/management of the threads).


There is another faster (at least for sizes of about 2000) attempt with Cython, which perform the summation not for small 50-element parts, but for the whole row thus reducing the overhead (but may have more cache misses once the row is larger than 1-2k):

%%cython -a --verbose -c=/openmp --link-args=/openmp -c=/arch:AVX512
import numpy as np
import cython
from cython.parallel import prange
cdef extern from *:
    """
    void calc_bin_row(double *ptr, int N, int y_offset, double* out){
       double *row = ptr;
       for(int y=1;y<N;y++){
           row+=y_offset; //next row
           for(int x=0;x<y_offset;x++){
               ptr[x]+=row[x];
           }
       }
       double res=0.0;
       int i=0;
       int k=0;
       for(int x=0;x<y_offset;x++){//could be made faster, but is it needed?
           res+=ptr[x];
           k++;
           if(k==N){
              k=0;
              out[i]=res;
              i++;
              res=0.0;
           }
       }
    }
    """
    void calc_bin_row(double *ptr, int N, int y_offset, double* out) nogil


@cython.boundscheck(False)
@cython.wraparound(False)
def cy_bin2d_parallel_rowise(double[:, ::1] a, int K):
    cdef int y_offset = a.shape[1]
    cdef int m_bins = a.shape[0]//K
    cdef int n_bins = a.shape[1]//K
    cdef double[:,:] res = np.empty((m_bins, n_bins), dtype=np.float64)
    cdef int i,j,k
    for k in prange(0, y_offset, K, nogil=True):
        calc_bin_row(&a[k, 0], K, y_offset, &res[k//K, 0])
    return res.base

which is already faster singlethreaded (2ms) and about 60% (1.27 ms ± 50.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)) faster with parallelization.

ead
  • 32,758
  • 6
  • 90
  • 153
  • The Numba version should be a bit faster if you sum up to a scalar in the second loop like you did in your C-code (and write the result at the end to the array res and allocate res using np.empty) Also fastmath (Numba and Cython), march=native (in Numba per default, but not in Cython) might help if the function isn't already completely memory-bottlenecked. – max9111 Apr 21 '20 at 13:10
  • @max9111, thanks, It was probably naive from my side to expect that writing to a global memory will be optimized as well as writing to a register/local variable. This change brought about 5%. However using fastmath had no effect (as could be expected because the only potential would be the last summation loop). Using /arch:AVX512 (it is MSVC) brought further speed-up for Cython. – ead Apr 21 '20 at 14:37
  • Found a bit faster way (only tested in Numba), but maybe you can gain a bit performance in Cython/C too. I will ad a "comment" answer. – max9111 Apr 21 '20 at 20:44
1

This is basically a comment on @ead answer.

Normally the best you can do on a aligned summation is too use scalars as much as possible (which maps to registers). This function also doesn't modify the input array, which may not be wanted.

Code and Timings

@nb.njit(parallel=True,fastmath=True,cache=True)
def nb_bin2d_parallel_2(a, K):
    #There is no bounds-checking, make sure that the dimensions are OK
    assert a.shape[0]%K==0
    assert a.shape[1]%K==0

    m_bins = a.shape[0]//K
    n_bins = a.shape[1]//K
    #Works for all datatypes, but overflow especially in small integer types
    #may occur
    res = np.zeros((m_bins, n_bins), dtype=a.dtype)

    for i in nb.prange(res.shape[0]):
        for ii in range(i*K,(i+1)*K):
            for j in range(res.shape[1]):
                TMP=res[i,j]
                for jj in range(j*K,(j+1)*K):
                    TMP+=a[ii,jj]
                res[i,j]=TMP
    return res

N,K=2000,50
a=np.arange(N*N, dtype=np.float64).reshape(N,N)

#warmup (Numba compilation is on the first call)
res_1=nb_bin2d_parallel(a, K)
res_2=cy_bin2d_parallel(a,K)
res_3=bin2d(a,K)
res_4=nb_bin2d_parallel_2(a, K)

%timeit bin2d(a,K)
#2.51 ms ± 25.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit nb_bin2d_parallel(a, K)
#1.33 ms ± 33.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit nb_bin2d_parallel_2(a, K)
#1.05 ms ± 8.96 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit cy_bin2d_parallel(a,K) #arch:AVX2
#996 µs ± 7.94 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

N,K=4000,50
a=np.arange(N*N, dtype=np.float64).reshape(N,N)

%timeit bin2d(a,K)
#10.8 ms ± 56.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit nb_bin2d_parallel(a, K)
#5.13 ms ± 46.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit nb_bin2d_parallel_2(a, K)
#3.99 ms ± 31.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit cy_bin2d_parallel(a,K) #arch:AVX2
#4.31 ms ± 168 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
max9111
  • 6,272
  • 1
  • 16
  • 33
  • True, it beats my (old) implementation quite comfortable. I guess sometimes I try to help the optimizer too much - they are just better than me. I've also posted another approach, which is slightly faster than your version - probably there are ways to improve it even further, but then one must really understand what is the bottleneck. – ead Apr 21 '20 at 21:09