I would like to achieve very efficient parallel reduction operation (i.e. summation): each of columns of a 2-dimensional array (memory buffer in row mayor memory layout) should be summed to an entry of a 1-dimensional array.
To be more clear about the expected input and output
double* array = malloc(sizeof(double) * shape0 * shape1) /* (shape0*shape1) 2-d array */
double* out = malloc(sizeof(double) * shape1) /* where out[j] = sum_j(array_ij) */
Parallelising the sum of rows is pretty straightforward and efficient because the values are contiguous in memory and there is no risk of race conditions. I found this works really well
void sum_rows(double* array, int shape0, int shape1, double* out) {
int i, j;
#pragma omp parallel for private(j) schedule(guided)
for (i=0; i < shape0; i++){
for (j=0; j < shape1; j++){
out[i] += array[shape1 * i + j];
}
}
}
I am founding more difficult to parallelise over the other axis. This should be a straightforward parallel recipe but I was not able to find a definitive answer what is the most efficient way to program this.
This is the naive serial code I would like to write an efficient parallel version of:
void sum_columns(double* array, int shape0, int shape1, double* out) {
int i, j;
for (i=0; i < shape0; i++){
for (j=0; j < shape1; j++){
out[j] += array[shape1 * i + j];
}
}
}
Note: I have already read the following q/a but they didn't lead me to any speedup over the naive sequential code:
Parallelizing matrix times a vector by columns and by rows with OpenMP