OpenMP parallel for reduction delivers wrong results

Question

I am working with a signal matrix and my goal is to calculate the sum of all elements of a row. The matrix is represented by the following struct:

typedef struct matrix {
  float *data;
  int rows;
  int cols;
  int leading_dim;
} matrix;

I have to mention the matrix is stored in column-major order (http://en.wikipedia.org/wiki/Row-major_order#Column-major_order), which should explain the formula column * tan_hd.rows + row for retrieving the correct indices.

for(int row = 0; row < tan_hd.rows; row++) {
    float sum = 0.0;
    #pragma omp parallel for reduction(+:sum)
    for(int column = 0; column < tan_hd.cols; column++) {
        sum += tan_hd.data[column * tan_hd.rows + row];
    }
    printf("row %d: %f", row, sum);
}

Without the OpenMP pragma, the delivered result is correct and looks like this:

row 0: 8172539.500000 row 1: 8194582.000000

As soon as I add the #pragma omp... as described above, a different (wrong) result is returned:

row 0: 8085544.000000 row 1: 8107186.000000

In my understanding, reduction(+:sum) creates private copies of sum for each thread, and after completing the loop these partial results are summed up and written back to the global variable sum again. What is it, that I am doing wrong?

I appreciate your suggestions!

I think the reason is `column` is local. for each parallel, `column` is initialized to zero, but should not be. Let it out of the `for` loop. — MYMNeo, Aug 02 '13 at 10:31
Floating-point addition is not associative - small rounding errors accumulate in a different way if the order of summing of the elements is changed. Also 64-bit or 80-bit internal precision could be used in the serial case which is lost during the reduction phase as each value is converted to single precision. — Hristo Iliev, Aug 02 '13 at 10:36
@Hristo Iliev: Thanks, I didn't know floats are not associative. Just for fun I'll try to use doubles instead and observe if the error gets smaller or remains unchanged :) — Max Plauth, Aug 02 '13 at 10:53
A first attempt of using double values just resulted in a larger error. On the other hand, I could not find any race conditions either. Neither `tan_hd.data` nor `row` are changed/written to during the `omp` section and `omp` should take care of `column`... — Max Plauth, Aug 02 '13 at 11:32
Oh, another thought: in what order of magnitude could the f-p arithmetic errors be? The array consists of 11.5 million columns of audio data ranging from -1.0 to 1.0, so we are dealing with many small values where the f-p error actually might sum up to such a big error... What do you think? — Max Plauth, Aug 02 '13 at 11:37
@HighPerformanceMark, I don't see any data races too and that's why my next guess was that it is due to rounding errors. Having an array with many positive and negative values could easily result in catastrophic cancellation. — Hristo Iliev, Aug 02 '13 at 11:37
Switching from floats to doubles would not improve the precision if you subtract similar values. It would only increase the amount of precision being lost. — Hristo Iliev, Aug 02 '13 at 11:40

Kyle_the_hacker · Accepted Answer · 2013-08-02T14:06:36.593

2

Use the Kahan summation algorithm

It has the same algorithmic complexity as a naive summation
It will greatly increase the accuracy of a summation, without requiring you to switch data types to double.

By rewriting your code to implement it:

for(int row = 0; row < tan_hd.rows; row++) {
    float sum = 0.0, c = 0.0;
    #pragma omp parallel for reduction(+:sum, +:c)
    for(int column = 0; column < tan_hd.cols; column++) {
        float y = tan_hd.data[column * tan_hd.rows + row] - c;
        float t = sum + y;
        c = (t - sum) - y;
        sum = t;
    }
    sum = sum - c;
    printf("row %d: %f", row, sum);
}

You can additionally switch all float to double to achieve a higher precision, but since your array is a float array, there should only be differences in the number of signficant numbers at the end.

edited Aug 02 '13 at 14:06

answered Aug 02 '13 at 12:24

Kyle_the_hacker

1,394
1
11
27

The problem is not that much in the intermediate summing but rather when the local values of `sum` are combined into the final result and it happens that some of them have opposite signs and close absolute values. – Hristo Iliev Aug 02 '13 at 13:10
@HristoIliev: Since OP said: "The array consists of 11.5 million columns of audio data ranging from -1.0 to 1.0", the problem **is** in the intermediate summing. – Kyle_the_hacker Aug 02 '13 at 13:25

OpenMP parallel for reduction delivers wrong results

1 Answers1

Linked