I have two remarks concerning Zboson's answer:
1. Method 1 is certainly correct but the reduction loop is actually run serially, because of the #pragma omp critical which is of course necessary as the partial matrices are local to each thread and the corresponding reduction has to be done by the thread owing the matrix.
2. Method 2: The initialization loop can be moved outside the single section and therefore become parallelizable.
The following program implements array reduction using openMP v4.0 user defined reduction facility:
/* Compile with:
gcc -Wall -fopenmp -o ar ar.c
Run with:
OMP_DISPLAY_ENV=TRUE OMP_NUM_THREADS=10 OMP_NESTED=TRUE ./ar
*/
#include <stdio.h>
#include <omp.h>
struct m10x1 {int v[10];};
int A [] = {84, 30, 95, 94, 36, 73, 52, 23, 2, 13};
struct m10x1 S = {{ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0}};
int n,m=0;
void print_m10x1(struct m10x1 x){
int i;
for(i=0;i<10;i++) printf("%d ",x.v[i]);
printf("\n");
}
struct m10x1 add_m10x1(struct m10x1 x,struct m10x1 y){
struct m10x1 r ={{ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0}};
int i;
for (i=0;i<10;i++) r.v[i]=x.v[i]+y.v[i];
return r;
}
#pragma omp declare reduction(m10x1Add: struct m10x1: \
omp_out=add_m10x1(omp_out, omp_in)) initializer( \
omp_priv={{ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0}} )
int main ()
{
#pragma omp parallel for reduction(m10x1Add: S)
for ( n=0 ; n<10 ; ++n )
{
for (m=0; m<=n; ++m){
S.v[n] += A[m];
}
}
print_m10x1(S);
}
This follows verbatim the complex number reduction example on page 97 of OpenMP 4.0 features.
Although the parallel version works correctly, there probably are performance issues, which I have not investigated:
- add_m10x1 inputs and output are passed by value.
- The loop in add_m10x1 is run serially.
Said "performance issues" are of my own making and it is completely straightforward not to introduce them:
- Parameters to add_m10x1 should be passed by reference (via pointers in C, references in C++)
- The computation in add_m10x1 should be done in place.
- add_m10x1 should be declared void and the return statement deleted. The result is returned via the first parameter.
- The declare reduction pragma should be accordingly modified, the combiner should be just a function call and not an assignment (v4.0 specs p181 lines 9,10).
- The for loop in add_m10x1 can be parallelized via an omp parallel for pragma
- Parallel nesting should be enabled (e.g. via OMP_NESTED=TRUE)
The modified part of the code then is:
void add_m10x1(struct m10x1 * x,struct m10x1 * y){
int i;
#pragma omp parallel for
for (i=0;i<10;i++) x->v[i] += y->v[i];
}
#pragma omp declare reduction(m10x1Add: struct m10x1: \
add_m10x1(&omp_out, &omp_in)) initializer( \
omp_priv={{ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0}} )