2

I have a nested loop: (L and A are fully defined inputs)

    #pragma omp parallel for schedule(guided) shared(L,A) \
    reduction(+:dummy)
    for (i=k+1;i<row;i++){
            for (n=0;n<k;n++){
                #pragma omp atomic
                dummy += L[i][n]*L[k][n];
                L[i][k] = (A[i][k] - dummy)/L[k][k];
            }
            dummy = 0;
    }

And its sequential version:

    for (i=k+1;i<row;i++){
            for (n=0;n<k;n++){
                dummy += L[i][n]*L[k][n];
                L[i][k] = (A[i][k] - dummy)/L[k][k];
            }
            dummy = 0;
    }

They both give different results. And parallel version is much slower than the sequential version.

What may cause the problem?

Edit:

To get rid of the problems caused by the atomic directive, I modified the code as follows:

#pragma omp parallel for schedule(guided) shared(L,A) \
    private(i)
    for (i=k+1;i<row;i++){
        double dummyy = 0;
        for (n=0;n<k;n++){
            dummyy += L[i][n]*L[k][n];
            L[i][k] = (A[i][k] - dummyy)/L[k][k];
        }
    }

But it also didn't work out the problem. Results are still different.

Emre Turkoz
  • 868
  • 1
  • 23
  • 33

3 Answers3

2

I am not very familiar with OpenMP but it seems to me that your calculations are not order-independent. Namely, the result in the inner loop is written into L[i][k] where i and k are invariants for the inner loop. This means that the same value is overwritten k times during the inner loop, resulting in a race condition.

Moreover, dummy seems to be shared between the different threads, so there might be a race condition there too, unless your pragma parameters somehow prevent it.

Altogether, to me it looks like the calculations in the inner loop must be performed in the same sequential order, if you want the same result as given by the sequential execution. Thus only the outer loop can be parallelized.

Péter Török
  • 114,404
  • 31
  • 268
  • 329
  • I also thought about that. But using these pragma commands, only the outer loop should be parallelized. But obviously, something else is going on within the threads so the results are not the same. – Emre Turkoz Apr 07 '12 at 08:45
2

In your parallel version you've inserted an unnecessary (and possibly harmful) atomic directive. Once you've declared dummy to be a reduction variable OpenMP takes care of stopping the threads interfering in the reduction. I think the main impact of the unnecessary directive is to slow your code down, a lot.

I see you have another answer addressing the wrongness of your results. But I notice that you seem to set dummy to 0 at the end of each outer loop iteration, which seems strange if you are trying to use it as some kind of accumulator, which is what the reduction clause suggests. Perhaps you want to reduce to dummy across the inner loop ?

If you are having problems with reduction read this.

High Performance Mark
  • 77,191
  • 7
  • 105
  • 161
  • I actually want the inner loop to run sequentially in each thread. I assume right now that the inner loop is also distributed - otherwise I wouldn't suffer these problems. I modified the code slightly. I edited so that you can see the modification. – Emre Turkoz Apr 07 '12 at 08:50
1

The difference in results comes from the inner loop variable n, which is shared between threads, since it is defined outside of the omp pragma.

Clarified: The loop variable n should be declared inside the omp pragma, since it should be thread-specific, for example for (int n = 0;.....)

Lubo Antonov
  • 2,301
  • 14
  • 18