Performance: efficient reduction of a 2d array over the columns [OpenMP, C]

Question

I would like to achieve very efficient parallel reduction operation (i.e. summation): each of columns of a 2-dimensional array (memory buffer in row mayor memory layout) should be summed to an entry of a 1-dimensional array.

To be more clear about the expected input and output

double* array = malloc(sizeof(double) * shape0 * shape1) /* (shape0*shape1) 2-d array */
double* out = malloc(sizeof(double) * shape1) /* where out[j] = sum_j(array_ij) */

Parallelising the sum of rows is pretty straightforward and efficient because the values are contiguous in memory and there is no risk of race conditions. I found this works really well

void sum_rows(double* array, int shape0, int shape1, double* out) {
    int i, j;
    #pragma omp parallel for private(j) schedule(guided) 
    for (i=0; i < shape0; i++){
        for (j=0; j < shape1; j++){
            out[i] += array[shape1 * i + j];
        }
    }
}

I am founding more difficult to parallelise over the other axis. This should be a straightforward parallel recipe but I was not able to find a definitive answer what is the most efficient way to program this.

This is the naive serial code I would like to write an efficient parallel version of:

void sum_columns(double* array, int shape0, int shape1, double* out) {
    int i, j;
    for (i=0; i < shape0; i++){
        for (j=0; j < shape1; j++){
            out[j] += array[shape1 * i + j];
        }
    }
}

Note: I have already read the following q/a but they didn't lead me to any speedup over the naive sequential code:

Parallelizing matrix times a vector by columns and by rows with OpenMP

OpenMP average of an array

Reduction with OpenMP

Apparently , you may be ignoring "definitive" documentation on how simd reduction is done for your inner loop. It also depends on your compiler and options. — tim18, Dec 09 '17 at 13:35
In case you need the hint, you would need simd intrinsics to optimize with Microsoft compiler, but it is easy with modern compilers. — tim18, Dec 09 '17 at 13:38
I would also quibble with your rejection of BLAS as not having definitive documentation. My previous comments assumed you didn't want to bypass openmp. — tim18, Dec 09 '17 at 14:10
I might have been expressing myself extremely poorly... I feel like I have been misunderstood here. I am just trying to find the solution to a problem that I recognize is probably trivial to people that work daily with openmp and was expressing surprise not to find the solution after some googling. No intention to be judgemental... Just asking for a solution. — Gioelelm, Dec 09 '17 at 18:17

score 2 · Accepted Answer · answered Dec 11 '17 at 17:08

Just reporting the faster implementation I was able to achieve after some attempts. Here I assign columns to the different threads, in such a way to work as local as possible and to avoid false sharing.

void sum_columns(double* array, int N_rows, int N_cols, double* out, int n_threads) {
    omp_set_dynamic(0);
    omp_set_num_threads(n_threads);
    #pragma omp parallel
    {
        /* private vars */
        int i, j, id, N_threads, col_chunk_size, start_col, end_col;
        /* ICVs */
        id = omp_get_thread_num();
        N_threads = omp_get_num_threads();
        /* distribute cols to different threads */
        col_chunk_size = N_cols / N_threads;
        start_col = id * col_chunk_size;
        end_col = (id+1) * col_chunk_size;
        if (id == N_threads - 1) end_col = N_cols;

        /* main loop */
        for (i=0; i < N_rows; i++){
            for (j=start_col; j < end_col; j++){
                out[j] += array[N_cols * i + j];
            }
        }
    }
 }

Performance: efficient reduction of a 2d array over the columns [OpenMP, C]

1 Answers1