Is there a way to do parallel reduction of an array on CPU in C/C++?. I recently learnt that it's not possible using openmp. Any other alternatives?
-
1You mean "it's not possible using OpenMP, unless you use a recent, standards conforming Fortran compiler", don't you? – talonmies Feb 22 '12 at 17:42
-
@talonmies Using C or C++. I edited it just now! – captain Feb 22 '12 at 17:49
-
You can use prefix sum to write a parallel reduction algorithm. http://en.wikipedia.org/wiki/Prefix_sum – perreal Feb 22 '12 at 17:49
-
@perreal I was thinking more of an API/built-in algorithm to do it. – captain Feb 22 '12 at 17:51
1 Answers
Added: Note that you can implement "custom" reduction with OpenMP, in the way described here.
For C++: with parallel_reduce
in Intel's TBB (SO tag: tbb), you can make reduction on complex types such as arrays and structs. Though the amount of required code can be significantly bigger compared to OpenMP's reduction clause.
As an example, let's parallelize a naive implementation of matrix-to-vector multiplication: y=Cx
. Serial code consists of two loops:
double x[N], y[M], C[N][M];
// assume x and C are initialized, and y consists of zeros
for(int i=0; i<N; ++i)
for(int j=0; j<M; ++j)
y[j] += C[i][j]*x[i];
Usually, to parallelize it the loops are exchanged to make the outer loop iterations independent and process them in parallel:
#pragma omp parallel for
for(int j=0; j<M; ++j)
for(int i=0; i<N; ++i)
y[j] += C[i][j]*x[i];
However it's not always good idea. If M is small and N is large, swapping the loop won't give enough parallelism (for example, think of calculating a weighted centroid of N points in M-dimensional space, with C
being the array of points and x
being the array of weights). So a reduction over an array (i.e. a point) would be helpful. Here is how it can be done with TBB (sorry, the code was not tested, errors are possible):
struct reduce_body {
double y_[M]; // accumulating vector
double (& C_)[N][M]; // reference to a matrix
double (& x_)[N]; // reference to a vector
reduce_body( double (&C)[N][M], double (&x)[N] ) : C_(C), x_(x) {
for (int j=0; j<M; ++j) y_[j] = 0.0; // prepare for accumulation
}
// splitting constructor required by TBB
reduce_body( reduce_body& rb, tbb::split ) : C_(rb.C_), x_(rb.x_) {
for (int j=0; j<M; ++j) y_[j] = 0.0;
}
// the main computation method
void operator()(const tbb::blocked_range<int>& r) {
// closely resembles the original serial loop
for (int i=r.begin(); i<r.end(); ++i) // iterates over a subrange in [0,N)
for (int j=0; j<M; ++j)
y_[j] += C_[i][j]*x_[i];
}
// the method to reduce computations accumulated in two bodies
void join( reduce_body& rb ) {
for (int j=0; j<M; ++j) y_[j] += rb.y_[j];
}
};
double x[N], y[M], C[N][M];
...
reduce_body body(C, x);
tbb::parallel_reduce(tbb::blocked_range<int>(0,N), body);
for (int j=0; j<M; ++j)
y[j] = body.y_[j]; // copy to the destination array
Disclaimer: I am affiliated with TBB.

- 1
- 1

- 12,479
- 2
- 36
- 55
-
1Hello @Alexey, there's not enough documentation on parallel_reduce for arrays, most of it is for simple reduction. Can you link to a source?. – captain Mar 02 '12 at 18:06
-
Thanks a lot for the code @Alexey. But, I didn't understand why a[j] is used? – captain Mar 04 '12 at 09:50
-