4

Is there a way to do parallel reduction of an array on CPU in C/C++?. I recently learnt that it's not possible using openmp. Any other alternatives?

Community
  • 1
  • 1
captain
  • 815
  • 14
  • 26

1 Answers1

7

Added: Note that you can implement "custom" reduction with OpenMP, in the way described here.


For C++: with parallel_reduce in Intel's TBB (SO tag: ), you can make reduction on complex types such as arrays and structs. Though the amount of required code can be significantly bigger compared to OpenMP's reduction clause.

As an example, let's parallelize a naive implementation of matrix-to-vector multiplication: y=Cx. Serial code consists of two loops:

double x[N], y[M], C[N][M];
// assume x and C are initialized, and y consists of zeros
for(int i=0; i<N; ++i)
  for(int j=0; j<M; ++j) 
    y[j] += C[i][j]*x[i];

Usually, to parallelize it the loops are exchanged to make the outer loop iterations independent and process them in parallel:

#pragma omp parallel for
for(int j=0; j<M; ++j) 
  for(int i=0; i<N; ++i)
    y[j] += C[i][j]*x[i];

However it's not always good idea. If M is small and N is large, swapping the loop won't give enough parallelism (for example, think of calculating a weighted centroid of N points in M-dimensional space, with C being the array of points and x being the array of weights). So a reduction over an array (i.e. a point) would be helpful. Here is how it can be done with TBB (sorry, the code was not tested, errors are possible):

struct reduce_body {
  double y_[M]; // accumulating vector
  double (& C_)[N][M]; // reference to a matrix
  double (& x_)[N];    // reference to a vector

  reduce_body( double (&C)[N][M], double (&x)[N] )  : C_(C), x_(x) {
    for (int j=0; j<M; ++j) y_[j] = 0.0; // prepare for accumulation
  }
  // splitting constructor required by TBB
  reduce_body( reduce_body& rb, tbb::split ) : C_(rb.C_), x_(rb.x_) { 
    for (int j=0; j<M; ++j) y_[j] = 0.0;
  }
  // the main computation method
  void operator()(const tbb::blocked_range<int>& r) {
    // closely resembles the original serial loop
    for (int i=r.begin(); i<r.end(); ++i) // iterates over a subrange in [0,N)
      for (int j=0; j<M; ++j)
        y_[j] += C_[i][j]*x_[i];
  }
  // the method to reduce computations accumulated in two bodies
  void join( reduce_body& rb ) {
    for (int j=0; j<M; ++j) y_[j] += rb.y_[j];
  }
};
double x[N], y[M], C[N][M];
...
reduce_body body(C, x);
tbb::parallel_reduce(tbb::blocked_range<int>(0,N), body);
for (int j=0; j<M; ++j)
  y[j] = body.y_[j]; // copy to the destination array

Disclaimer: I am affiliated with TBB.

Community
  • 1
  • 1
Alexey Kukanov
  • 12,479
  • 2
  • 36
  • 55