Parallelise 2 for loops with OpenMP

Question

So I have this function that I have to parallelize with OpenMP static scheduling for n threads

void computeAccelerations(){
int i,j;
for(i=0;i<bodies;i++){
    accelerations[i].x = 0; accelerations[i].y = 0; accelerations[i].z = 0;
    for(j=0;j<bodies;j++){
        if(i!=j){
            //accelerations[i] = addVectors(accelerations[i],scaleVector(GravConstant*masses[j]/pow(mod(subtractVectors(positions[i],positions[j])),3),subtractVectors(positions[j],positions[i])));
            vector sij = {positions[i].x-positions[j].x,positions[i].y-positions[j].y,positions[i].z-positions[j].z};
            vector sji = {positions[j].x-positions[i].x,positions[j].y-positions[i].y,positions[j].z-positions[i].z};
            double mod = sqrt(sij.x*sij.x + sij.y*sij.y + sij.z*sij.z);
            double mod3 = mod * mod * mod;
            double s = GravConstant*masses[j]/mod3;
            vector S = {s*sji.x,s*sji.y,s*sji.z};
            accelerations[i].x+=S.x;accelerations[i].y+=S.y;accelerations[i].z+=S.z;
        }
    }
}

}

I tried to do something like:

void computeAccelerations_static(int num_of_threads){
int i,j;
#pragma omp parallel for num_threads(num_of_threads) schedule(static)
for(i=0;i<bodies;i++){
    accelerations[i].x = 0; accelerations[i].y = 0; accelerations[i].z = 0;
    for(j=0;j<bodies;j++){
        if(i!=j){
            //accelerations[i] = addVectors(accelerations[i],scaleVector(GravConstant*masses[j]/pow(mod(subtractVectors(positions[i],positions[j])),3),subtractVectors(positions[j],positions[i])));
            vector sij = {positions[i].x-positions[j].x,positions[i].y-positions[j].y,positions[i].z-positions[j].z};
            vector sji = {positions[j].x-positions[i].x,positions[j].y-positions[i].y,positions[j].z-positions[i].z};
            double mod = sqrt(sij.x*sij.x + sij.y*sij.y + sij.z*sij.z);
            double mod3 = mod * mod * mod;
            double s = GravConstant*masses[j]/mod3;
            vector S = {s*sji.x,s*sji.y,s*sji.z};
            accelerations[i].x+=S.x;accelerations[i].y+=S.y;accelerations[i].z+=S.z;
        }
    }
}

It comes naturally to just add the #pragma omp parallel for num_threads(num_of_threads) schedule(static) but it isn't correct.

I think there is some kind of false sharing with the accelerations[i] but I don't know how to approach it. I appreciate any kind of help. Thank you.

Can you be more specific than "it isn't correct"? Do you mean that the wrong results are computed? — John Bollinger, Feb 15 '23 at 20:13
Does this answer your question? [How does OpenMP handle nested loops?](https://stackoverflow.com/questions/13357065/how-does-openmp-handle-nested-loops) — Jorge López, Feb 15 '23 at 20:19
Note that "false sharing" is when multiple threads reads/writes to the same shared cache lines while accessing to different variables located near each other (sharing the same cache line). False sharing should *never impact results* but tends to decrease performance. This should not be confused with *race conditions* (like when multiples threads read/write the same variable in parallel) : race condition *can* produce wrong results (and cause an *undefined behaviour* and often impact performance too). — Jérôme Richard, Feb 15 '23 at 22:16

score 2 · Accepted Answer · answered Feb 15 '23 at 20:47

In your loop nest, only the iterations of the outer loop are parallelized. Because i is the loop-control variable, each thread gets its own, private copy, but as a matter of style, it would be better to declare i in the loop control block.

j is another matter. It is declared outside the parallel region and it is not the control variable of a parallelized loop. As a result, it is shared among the threads. Because each of the threads executing i-loop iterations manipulates shared variable j, you have a huge problem with data races. This would be resolved (among other alternatives) by moving the declaration of j into the parallel region, preferrably into the control block of its associated loop.

Overall, then:

// int i, j;
#pragma omp parallel for num_threads(num_of_threads) schedule(static)
for (int i = 0; i < bodies; i++) {
    accelerations[i].x = 0;
    accelerations[i].y = 0;
    accelerations[i].z = 0;

    for (int j = 0; j < bodies; j++) {
        if (i != j) {
            //accelerations[i] = addVectors(accelerations[i],scaleVector(GravConstant*masses[j]/pow(mod(subtractVectors(positions[i],positions[j])),3),subtractVectors(positions[j],positions[i])));
            vector sij = { positions[i].x - positions[j].x,
                           positions[i].y - positions[j].y,
                           positions[i].z - positions[j].z };
            vector sji = { positions[j].x - positions[i].x,
                           positions[j].y - positions[i].y,
                           positions[j].z - positions[i].z };
            double mod = sqrt(sij.x * sij.x + sij.y * sij.y + sij.z * sij.z);
            double mod3 = mod * mod * mod;
            double s = GravConstant * masses[j] / mod3;
            vector S = { s * sji.x, s * sji.y, s * sji.z };
            accelerations[i].x += S.x;
            accelerations[i].y += S.y;
            accelerations[i].z += S.z;
        }
    }
}

Note also that computing sji appears to be wasteful, as in mathematical terms it is just -sij, and neither sji nor sij is modified. I would probably reduce the above to something more like this:

#pragma omp parallel for num_threads(num_of_threads) schedule(static)
for (int i = 0; i < bodies; i++) {
    accelerations[i].x = 0;
    accelerations[i].y = 0;
    accelerations[i].z = 0;

    for (int j = 0; j < bodies; j++) {
        if (i != j) {
            vector sij = { positions[i].x - positions[j].x,
                           positions[i].y - positions[j].y,
                           positions[i].z - positions[j].z };
            double mod = sqrt(sij.x * sij.x + sij.y * sij.y + sij.z * sij.z);
            double mod3 = mod * mod * mod;
            double s = GravConstant * masses[j] / mod3;

            accelerations[i].x -= s * sij.x;
            accelerations[i].y -= s * sij.y;
            accelerations[i].z -= s * sij.z;
        }
    }
}

Great. Note that `GravConstant * masses[j]` can also be precomputed for all `j`. The input data layout can be modified so to be more SIMD-friendly : using a structure of 3 arrays rather than an array of 3 fields. The condition can be removed using loop a splitting but a good compiler should be able to do that. Finally, tiling can be use to make the inner loop more cache friendly. Moving `accelerations[i].xxx -= ...` outside the loop and using a local accumulator should certainly be a good idea to prevent any false-sharing and help the compiler (though it should be able to optimize that). — Jérôme Richard, Feb 15 '23 at 22:10

Parallelise 2 for loops with OpenMP

1 Answers1