I want to parallelize a fragment of code in C usig OpenMP but getting a wrong output

Question

I have this fragment of code that calculates the force experienced by each unique pair of particles due to a potential form. I am trying to parallelize the nested for loops, however, the output that is expected is different from the correct output. The code is :

for(ic = 0; ic < (N_TOTAL-1); ic++)
{
     for (jc = ic+1; jc < N_TOTAL; jc++)
          { 
             
               rc_x = X[ic]-X[jc];
               rc_z = Z[ic]-Z[jc];

            /* Minimum image convention */
            if (fabs(rc_x) > LxScaled/2.0)
              {
                rc_x = rc_x - SignR (LxScaled, rc_x);
              }
            if (fabs(rc_z) > LzScaled/2.0)
              {
                rc_z = rc_z - SignR (LzScaled, rc_z);
              }  
     
            rij = sqrt(rc_x * rc_x + rc_z * rc_z);
            rr = rr + (rc_x * rc_x + rc_z * rc_z);
            
            /* Compute Force : Yukawa potential*/
            second = exp (-rij * Xi ) * ((1.0*(1.0))/rij);
            third = (Xi+1.0/rij);
            a = second * third;

            a_x[ic] = a_x[ic] + (rc_x/rij)*a;
            a_x[jc] = a_x[jc] - (rc_x/rij)*a; 
            a_z[ic] = a_z[ic] + (rc_z/rij)*a;
            a_z[jc] = a_z[jc] - (rc_z/rij)*a;

        uVal = second;
        uSum = uSum + (second);
        
         }  
      }
    }

Here X[i] and Z[i] are the position and the a_x[i] and a_z[i] are the x and z components of the force experienced by the ith particle. I have an OpenMP version 4.0 thus I cannot use reduction clause for the 1D array a_x[] and a_z[]. Thus I intend to use #pragma omp atomic caluse to avoid any race conditions. Thus i use this to parallelize as :

#pragma omp parallel for schedule(guided, 8) reduction(+:rr, uSum) default(shared) private(ic, jc, rc_x, rc_z, rij, uVal, second, third, a) 

for(ic = 0; ic < (N_TOTAL-1); ic++)
{
     for (jc = ic+1; jc < N_TOTAL; jc++)
          { 
             
               rc_x = X[ic]-X[jc];
               rc_z = Z[ic]-Z[jc];

            /* Minimum image convention */
            if (fabs(rc_x) > LxScaled/2.0)
              {
                #pragma omp atomic
                rc_x = rc_x - SignR (LxScaled, rc_x);
              }
            if (fabs(rc_z) > LzScaled/2.0)
              {
                #pragma omp atomic
                rc_z = rc_z - SignR (LzScaled, rc_z);
              }  
     
            rij = sqrt(rc_x * rc_x + rc_z * rc_z);
            rr = rr + (rc_x * rc_x + rc_z * rc_z);
            
            /* Compute Force : Yukawa potential*/
            second = exp (-rij * Xi ) * ((1.0*(1.0))/rij);
            third = (Xi+1.0/rij);
            a = second * third;

            #pragma omp atomic
            a_x[ic] = a_x[ic] + (rc_x/rij)*a;
            #pragma omp atomic
            a_x[jc] = a_x[jc] - (rc_x/rij)*a; 
            #pragma omp atomic
            a_z[ic] = a_z[ic] + (rc_z/rij)*a;
            #pragma omp atomic
            a_z[jc] = a_z[jc] - (rc_z/rij)*a;

        uVal = second;
        uSum = uSum + (second);
        
         }  
      }
    }

However I am not getting the correct output which is to be expected when the fragment of code is run in series. There are some deviation over time when the same force is computed over a longer period of time.

Where is the mistake that I am making, I try to use atomic update of force to eliminate any race condition that might occue. I am new to parallelization part and this is the 1t project I am trying to parallelize, Can anyon spot the mistake and point me in the right direction.

Edit: I also tried to reduce the two nested for loops into 1 however I am unable to procude the exact output as in serial.

What do you mean by exact output? How much is the difference? Small difference (due to rounding errors) can be expected if the order of additions are different... — Laci, Sep 02 '22 at 07:15
If you do floating point operations in different order, you'll often get different results. — Ted Lyngmo, Sep 02 '22 at 07:27
There is no data race in your code, but are you sure that there is no loop carried dependency in your algorithm? An easy (but not complete) check is to run (especially the first) loop (ic) in reverse order. If you do so, do you get the same result? Note that `rc_x` and `rc_z` are private, so there is no need for atomic operations when you change them. I also suggest you using your variables at their minimum required scope. Note also that `uVal` is not used in your code.. — Laci, Sep 02 '22 at 07:48
@Laci When I say output I mean that, this fragment is a part of code within a while loop. Once force is calculated some operations are performed and the positions are updated and time is incremented by some quantity 'dt. Once again the same force is calculated ' and the force is calculated once again for the updated configuration and the while loop goes on until some time 'tmax' is reached. So over time, the error in calculated force value increases and some minute fluctuations from the actual value is observed. — Vishal Prajapati, Sep 02 '22 at 08:36
@TedLyngmo U mean to say that the order in which the operations on the floating variables are done could be the reason for the error. What if I used the 'ordred' clause in '#pragma omp' statement to set the execution of the operations in a sequenctial manner. could that be a feaable solution. — Vishal Prajapati, Sep 02 '22 at 08:42
Check https://stackoverflow.com/questions/20993327/using-openmp-critical-and-ordered/20996615 — High Performance Mark, Sep 02 '22 at 08:45
@Lace Also the 'uVal' value assigned is a gobal variable which will be used somewhere else in the program.... I have not given the complete program just the loop that I want to parallelize. I have developed a simulation code now the next step is to parallelize it. As calculating the force between each unique pair of particles is the most time consuming part of the program therefore I started with parallelization of this nested for loop first. And this loop itself is giving me a hard time. — Vishal Prajapati, Sep 02 '22 at 08:45
@VishalPrajapati _"What if I used the 'ordred' clause in `#pragma omp` statement ..."_ - That should make sure that you get the same result every time - but it doesn't guarantee that you get the smallest error. — Ted Lyngmo, Sep 02 '22 at 08:51
Also, the variables data types are: int N_TOTAL, ic, jc ; double a, rc_x, rc_z, X[], Z[], a_x[], a_z[]; double rij, rr; double second, third; double uVal; — Vishal Prajapati, Sep 02 '22 at 08:58
https://imgur.com/a/8NXslqU You can view this image to see what i mean. First, the calculated a_x[] and a_z[] values are used to calculate Kinetic energy of the system. Then this Kinetic energy is plotted as a function of time. Here is the output, Red curve is the output when the code is run in serial mode and the green color curve is the output when the for loops is made parallel. Initially we can get accurate results but as more and more forces are calculated, the small error add up becoming significant — Vishal Prajapati, Sep 02 '22 at 09:04
Here the x axis represent the number of times the force is calculaed. 1 means the nested for loop is executed once. 100 means the force is calculated in parallel, some other operations are performed, again force is calculated and the process is repeated 100 times. and y-axis is the kinetic energy of the system — Vishal Prajapati, Sep 02 '22 at 09:09
@VishalPrajapati To prove a point: Here's an [example](https://godbolt.org/z/PGoaY4n33) where I sum up 1024 random doubles in different order. — Ted Lyngmo, Sep 02 '22 at 09:19
@TedLyngmo, I think I get your point, Basically the order in which floating points are added, matters significantly. Or i can say that I will have to settle with small rounding off errors. Also, as suggested by Laci, i tried to execute the fragment of code by reversing the order of the 1st loop (ic), this is also causing some variations in the results even when the code is run in serial without parallelization. Thus i guess that is due to the order of operations. — Vishal Prajapati, Sep 02 '22 at 09:35
However, as far as the parallelization is considered, can it be considered to be correct? If the rounding off errors due to order of floating point operations are ignored — Vishal Prajapati, Sep 02 '22 at 09:36
Also, any suggestions in improving the speed of the code? Atomic operations could be a sort of bottleneck when large iterations or large number of threads are used. — Vishal Prajapati, Sep 02 '22 at 09:38
I haven't analyzed your code in depth, but try to only operate on an isolated part of the data in the inner loops. Having synchronization (atomics, mutex locks etc) will probably slow it down. If you can, prepare the data for each inner loop in such a way that it's cache-friendly and try to avoid false sharing. When I've managed to do that I've gotten almost X times the speed compared to running it in serial, where X is the number of logical processors. I haven't used OMP directly for that though, but the general idea should hold. — Ted Lyngmo, Sep 02 '22 at 09:56
I think If `uVal` is a global variable and used later in your code, it is definitely a problem. Its value does depend on the execution order. (I do not mean rounding errors) — Laci, Sep 02 '22 at 12:59
Can you print out the first calculated value of (`rr`, `uSum`, `uVal`) after the `for` loops using serial, "reversed" serial and parallel code? Please use at least 15 significant digits and update your post and show us these values. — Laci, Sep 02 '22 at 13:04

PierU · Answer 1 · 2022-09-03T22:12:44.723

Since I cannot comment at the moment, I answer your additional question in your comment "any suggestions in improving the speed of the code?"

Move all the atomic operations outside of the inner loop. You can do that by declaring local pointers a_x_local and a_y_local and allocate them (with the same size as a_x and a_y) at the beginning of the outer loop. Since they are private you no longer need atomic operations to update them in the inner loop. Then, after the inner loop you update all the elements a_x and a_z at once (with a critical section or atomic operations).

#pragma omp parallel for schedule(guided, 8) reduction(+:rr, uSum) default(shared) private(ic, jc, rc_x, rc_z, rij, uVal, second, third, a) 

for(ic = 0; ic < (N_TOTAL-1); ic++)
{
    double *a_x_local, *a_z_local;
    a_x_local = (double *)malloc(N_TOTAL * sizeof(double));
    a_z_local = (double *)malloc(N_TOTAL * sizeof(double));
    for (jc = 0; jc < N_TOTAL; jc++) {
        a_x_local[jc] = 0.0;
        a_z_local[jc] = 0.0;
    }
    for (jc = ic+1; jc < N_TOTAL; jc++)
        { 
            rc_x = X[ic]-X[jc];
            rc_z = Z[ic]-Z[jc];
            /* Minimum image convention */
            if (fabs(rc_x) > LxScaled/2.0)
                rc_x = rc_x - SignR (LxScaled, rc_x);
            if (fabs(rc_z) > LzScaled/2.0)
                rc_z = rc_z - SignR (LzScaled, rc_z);
            rij = sqrt(rc_x * rc_x + rc_z * rc_z);
            rr = rr + (rc_x * rc_x + rc_z * rc_z);
            /* Compute Force : Yukawa potential*/
            second = exp (-rij * Xi ) * ((1.0*(1.0))/rij);
            third = (Xi+1.0/rij);
            a = second * third;

            a_x_local[ic] += (rc_x/rij)*a;
            a_x_local[jc] -= (rc_x/rij)*a; 
            a_z_local[ic] += (rc_z/rij)*a;
            a_z_local[jc] -= (rc_z/rij)*a;

            uVal = second;
            uSum = uSum + (second);
        }  
    }
    #pragma omp critical
    for (jc = 0; jc < N_TOTAL; jc++) {
        a_x[jc] += a_x_local[jc];
        a_z[jc] += a_z_local[jc];
    } 
    free(a_x_local);
    free(a_z_local);
}

And as pointed out in the comments, you don't need the atomic pragma to update rc_x and rc_z.

Also, you may omit the schedule clause at first, and test various scheduling options later on.

EDIT: following Laci's answer I have timed the various code too.

I have set N_TOTAL=50000

original code, compiled without OpenMP (just -O3 with gcc): 20"

(all following compiled with OpenMP)

original code, running with 1 thread: 114" (!!)

(all following running with 8 threads)

original code: 28"
my code with a critical region: 9"
my code with atomics instead: 10"
Laci's code with atomics: 8"
Laci's code with critical region instead: 6"

Observations:

the original version runs quite poorly with OpenMP, likely because of the atomics everythere
Laci's solution with a single update of the global arrays at the end of each thread is (as expected) better than my solution with N_TOTAL updates
However, a critical region for the whole update loop is faster than atomic operations within the loop (a likely explanation is that atomic breaks the vectorisation).

Regarding atomic versus critical

I have set a minimal code example (below). With 8 threads the atomic version runs in 16", the critical version in 9". This is not to say that critical is always faster than atomic: it is not, and simply replacing atomic by critical is probably always slower. However a single critical region that encompasses a loop can be faster then repeated atomic operations within the loop.

atomic version:

#include <math.h>
#include <stdlib.h>
void main() {
    int ic, jc, N_TOTAL=50000;
    double *a_x;
    a_x = (double*)malloc(N_TOTAL*sizeof(double));
    for (jc = 0; jc < N_TOTAL ; jc++) a_x[jc] = 0.0;

    #pragma omp parallel private(ic,jc)
    { 
        double *a_x_local;
        a_x_local = (double*)malloc(N_TOTAL*sizeof(double));

        #pragma omp parallel for 
        for(ic = 0; ic < (N_TOTAL-1); ic++) {
            for (jc = ic+1; jc < N_TOTAL; jc++) { 
                a_x_local[jc] = sqrt((double)(ic+jc));
                #pragma omp atomic
                a_x[jc] += a_x_local[jc];
            }
        }
        free(a_x_local);
    } /* end parallel */
} /* end main */

critical version:

#include <math.h>
#include <stdlib.h>
void main() {
    int ic, jc, N_TOTAL=50000;
    double *a_x;
    a_x = (double*)malloc(N_TOTAL*sizeof(double));
    for (jc = 0; jc < N_TOTAL ; jc++) a_x[jc] = 0.0;

    #pragma omp parallel private(ic,jc)
    { 
        double *a_x_local;
        a_x_local = (double*)malloc(N_TOTAL*sizeof(double));

        #pragma omp parallel for 
        for(ic = 0; ic < (N_TOTAL-1); ic++) {
            for (jc = ic+1; jc < N_TOTAL; jc++) { 
                a_x_local[jc] = sqrt((double)(ic+jc));
                #pragma omp atomic
                a_x[jc] += a_x_local[jc];
            }
        }
        free(a_x_local);
    } /* end parallel */
} /* end main */

Note that 1) Instead of using critical section, atomic operations are somewhat faster, 2) It is enough to allocate/free `a_x_local` and `a_z_local` arrays only once per thread (instead of N_TOTAL-1 total times). As memory allocation is time consuming, it will definitely makes the code faster. — Laci, Sep 02 '22 at 14:51
@Laci Generally speaking you're absolutely right, but I doubt it can make an observable difference in practice on this code. 2 allocations are negligible compared to the inner loop executed N_TOTAL times with many operations (I assume that N_TOTAL is large, otherwise there's almost no point about speeding-up the code :)). Regarding one critical region that encompasses the whole update loop versus many atomic for each elemental operation I'm not sure and I would test it (for instance what happens to the vectorisation of the loop if there are atomic pragmas inside ?) — PierU, Sep 02 '22 at 15:10
Instead of theoretical discussion, I have measured it. Please check my answer. — Laci, Sep 02 '22 at 22:50
@Laci OK... I guess this is faster also because you update the global arrays only once per thread (instead of in the inner loop in the original one and in the outer in mine) — PierU, Sep 03 '22 at 07:49
As far as `SignR` function is considered.. it gives the sign (+ or -) of `rc_x` to `LxScaled` and returns the value of 'LxScaled' with the sign taken from `rc_x`.. And, if `uVal` is considered... i dont need it... i finally excluded `uVal` from the code ... thank you everyone for all your support and help learning from you all... — Vishal Prajapati, Sep 03 '22 at 10:44
@Laci , I have edited my answer with some timings too, particularly about atomic versus critical — PierU, Sep 03 '22 at 15:50

score 1 · Answer 2 · answered Sep 02 '22 at 22:44

As already explained in the comments if you do floating point operations in a different order you may get a different result. It is due to rounding errors, for more details please read the following document: What Every Computer Scientist Should Know About Floating-Point Arithmetic
There is no race condition in your code, but there is a problem with your global variable uVal. In the parallel region its sharing attribute is private, so if you change it inside the parallel region, it does not change the global value. In your serial code, it is related to the last calculated particle pair, which makes no sense to me. You mentioned that you use it in other part of the code. In my opinion, it can be easily calculated when needed. No need to use it in your parallel region.
Variables rc_x and rc_z are private, so there is no need for atomic operations when you change them.
I suggest using your variables in their minimum required scope. It may help the compiler to make better optimizations.
Multiplications are generally faster than division, so you should change your code accordingly, e.g. instead of LxScaled/2.0 use LxScaled*0.5
You mentioned that you use OpenMP 4.0, which does not have array reduction. It is not a problem, you can do it easily manually.

Putting it together your code should be something like this:

#pragma omp parallel default(none) shared(X,Z,a_x,a_z, uSum, rr, Xi) 
{
    double a_x_local[N_TOTAL];
    double a_z_local[N_TOTAL];
    for (int jc = 0; jc < N_TOTAL; jc++) {
        a_x_local[jc] = 0.0;
        a_z_local[jc] = 0.0;
    }

    #pragma omp for schedule(guided, 8) reduction(+:rr, uSum) 
    for(int ic = 0; ic < (N_TOTAL-1); ic++)
    {
        for (int jc = ic+1; jc < N_TOTAL; jc++)
            { 
                double rc_x = X[ic]-X[jc];
                double rc_z = Z[ic]-Z[jc];
                /* Minimum image convention */
                if (fabs(rc_x) > 0.5*LxScaled)
                    rc_x -=  SignR (LxScaled, rc_x);
                if (fabs(rc_z) > 0.5*LzScaled)
                    rc_z -=  SignR (LzScaled, rc_z);
                const double d_rr=rc_x * rc_x + rc_z * rc_z;
                rr += d_rr;
                const double rij = sqrt(d_rr);
                const double rij_1 = 1.0/rij;           
                /* Compute Force : Yukawa potential*/
                const double second = exp (-rij * Xi )*rij_1;
                const double second_x_third = second * (Xi+rij_1);
                const double d_rc_x = rc_x* rij_1 * second_x_third;
                const double d_rc_z = rc_z* rij_1 * second_x_third;

                a_x_local[ic] += d_rc_x;
                a_x_local[jc] -= d_rc_x; 
                a_z_local[ic] += d_rc_z;
                a_z_local[jc] -= d_rc_z;

                //uVal = second; Shoud be calculated when needed
                uSum += second;
            }  
    }
    for (int jc = 0; jc < N_TOTAL; jc++) {
        #pragma omp atomic
        a_x[jc] += a_x_local[jc];
        #pragma omp atomic
        a_z[jc] += a_z_local[jc];
    } 
}

On my laptop this code runs faster than your or @PierU's parallel code (of course, I just used arbitrary data, and I have no idea what your SignR function does, the code is here):

Measured runtimes (N_TOTAL=10000):
Your serial code=0.520492 s
Your parallel code=0.169654 s
PierU's parallel code=0.140338 s
My parallel code=0.092958 s

I want to parallelize a fragment of code in C usig OpenMP but getting a wrong output

2 Answers2

EDIT: following Laci's answer I have timed the various code too.

Regarding atomic versus critical