Parallel reduction of an array on CPU

Question

Is there a way to do parallel reduction of an array on CPU in C/C++?. I recently learnt that it's not possible using openmp. Any other alternatives?

You mean "it's not possible using OpenMP, unless you use a recent, standards conforming Fortran compiler", don't you? — talonmies, Feb 22 '12 at 17:42
You can use prefix sum to write a parallel reduction algorithm. http://en.wikipedia.org/wiki/Prefix_sum — perreal, Feb 22 '12 at 17:49
@perreal I was thinking more of an API/built-in algorithm to do it. — captain, Feb 22 '12 at 17:51

score 7 · Accepted Answer · edited May 23 '17 at 11:58

Added: Note that you can implement "custom" reduction with OpenMP, in the way described here.

For C++: with parallel_reduce in Intel's TBB (SO tag: tbb), you can make reduction on complex types such as arrays and structs. Though the amount of required code can be significantly bigger compared to OpenMP's reduction clause.

As an example, let's parallelize a naive implementation of matrix-to-vector multiplication: y=Cx. Serial code consists of two loops:

double x[N], y[M], C[N][M];
// assume x and C are initialized, and y consists of zeros
for(int i=0; i<N; ++i)
  for(int j=0; j<M; ++j) 
    y[j] += C[i][j]*x[i];

Usually, to parallelize it the loops are exchanged to make the outer loop iterations independent and process them in parallel:

#pragma omp parallel for
for(int j=0; j<M; ++j) 
  for(int i=0; i<N; ++i)
    y[j] += C[i][j]*x[i];

However it's not always good idea. If M is small and N is large, swapping the loop won't give enough parallelism (for example, think of calculating a weighted centroid of N points in M-dimensional space, with C being the array of points and x being the array of weights). So a reduction over an array (i.e. a point) would be helpful. Here is how it can be done with TBB (sorry, the code was not tested, errors are possible):

struct reduce_body {
  double y_[M]; // accumulating vector
  double (& C_)[N][M]; // reference to a matrix
  double (& x_)[N];    // reference to a vector

  reduce_body( double (&C)[N][M], double (&x)[N] )  : C_(C), x_(x) {
    for (int j=0; j<M; ++j) y_[j] = 0.0; // prepare for accumulation
  }
  // splitting constructor required by TBB
  reduce_body( reduce_body& rb, tbb::split ) : C_(rb.C_), x_(rb.x_) { 
    for (int j=0; j<M; ++j) y_[j] = 0.0;
  }
  // the main computation method
  void operator()(const tbb::blocked_range<int>& r) {
    // closely resembles the original serial loop
    for (int i=r.begin(); i<r.end(); ++i) // iterates over a subrange in [0,N)
      for (int j=0; j<M; ++j)
        y_[j] += C_[i][j]*x_[i];
  }
  // the method to reduce computations accumulated in two bodies
  void join( reduce_body& rb ) {
    for (int j=0; j<M; ++j) y_[j] += rb.y_[j];
  }
};
double x[N], y[M], C[N][M];
...
reduce_body body(C, x);
tbb::parallel_reduce(tbb::blocked_range<int>(0,N), body);
for (int j=0; j<M; ++j)
  y[j] = body.y_[j]; // copy to the destination array

Disclaimer: I am affiliated with TBB.

Hello @Alexey, there's not enough documentation on parallel_reduce for arrays, most of it is for simple reduction. Can you link to a source?. — captain, Mar 02 '12 at 18:06
Thanks a lot for the code @Alexey. But, I didn't understand why a[j] is used? — captain, Mar 04 '12 at 09:50

Parallel reduction of an array on CPU

1 Answers1