Sequential and parallel versions give different results - Why?

Question

I have a nested loop: (L and A are fully defined inputs)

    #pragma omp parallel for schedule(guided) shared(L,A) \
    reduction(+:dummy)
    for (i=k+1;i<row;i++){
            for (n=0;n<k;n++){
                #pragma omp atomic
                dummy += L[i][n]*L[k][n];
                L[i][k] = (A[i][k] - dummy)/L[k][k];
            }
            dummy = 0;
    }

And its sequential version:

    for (i=k+1;i<row;i++){
            for (n=0;n<k;n++){
                dummy += L[i][n]*L[k][n];
                L[i][k] = (A[i][k] - dummy)/L[k][k];
            }
            dummy = 0;
    }

They both give different results. And parallel version is much slower than the sequential version.

What may cause the problem?

Edit:

To get rid of the problems caused by the atomic directive, I modified the code as follows:

#pragma omp parallel for schedule(guided) shared(L,A) \
    private(i)
    for (i=k+1;i<row;i++){
        double dummyy = 0;
        for (n=0;n<k;n++){
            dummyy += L[i][n]*L[k][n];
            L[i][k] = (A[i][k] - dummyy)/L[k][k];
        }
    }

But it also didn't work out the problem. Results are still different.

If you are operating on floating-point numbers, read this http://stackoverflow.com/a/8991640/893693 — Stephan Dollberg, Apr 07 '12 at 08:21

Péter Török · Answer 1 · 2012-04-07T08:20:09.143

I am not very familiar with OpenMP but it seems to me that your calculations are not order-independent. Namely, the result in the inner loop is written into L[i][k] where i and k are invariants for the inner loop. This means that the same value is overwritten k times during the inner loop, resulting in a race condition.

Moreover, dummy seems to be shared between the different threads, so there might be a race condition there too, unless your pragma parameters somehow prevent it.

Altogether, to me it looks like the calculations in the inner loop must be performed in the same sequential order, if you want the same result as given by the sequential execution. Thus only the outer loop can be parallelized.

I also thought about that. But using these pragma commands, only the outer loop should be parallelized. But obviously, something else is going on within the threads so the results are not the same. — Emre Turkoz, Apr 07 '12 at 08:45

High Performance Mark · Answer 2 · 2012-04-07T08:32:43.943

In your parallel version you've inserted an unnecessary (and possibly harmful) atomic directive. Once you've declared dummy to be a reduction variable OpenMP takes care of stopping the threads interfering in the reduction. I think the main impact of the unnecessary directive is to slow your code down, a lot.

I see you have another answer addressing the wrongness of your results. But I notice that you seem to set dummy to 0 at the end of each outer loop iteration, which seems strange if you are trying to use it as some kind of accumulator, which is what the reduction clause suggests. Perhaps you want to reduce to dummy across the inner loop ?

If you are having problems with reduction read this.

I actually want the inner loop to run sequentially in each thread. I assume right now that the inner loop is also distributed - otherwise I wouldn't suffer these problems. I modified the code slightly. I edited so that you can see the modification. — Emre Turkoz, Apr 07 '12 at 08:50

Lubo Antonov · Accepted Answer · 2013-10-02T11:40:59.180

1

The difference in results comes from the inner loop variable n, which is shared between threads, since it is defined outside of the omp pragma.

Clarified: The loop variable n should be declared inside the omp pragma, since it should be thread-specific, for example for (int n = 0;.....)

edited Oct 02 '13 at 11:40

answered Apr 07 '12 at 09:08

Lubo Antonov

2,301
14
18

Sequential and parallel versions give different results - Why?

3 Answers3

Linked