I have this fragment of code which is the solution of newtons equation of motion and gives the position and velocities of the the particles for the next step. deltat_s
and dtSqr
are constants and X[]
, Z[]
, V_X[]
, V_Z[]
are double type array holding the position and velocities values respectively a_x[]
, a_z[]
are holding the acceleration values. where X and Z represent the x and z components of the vector. The fragment of code basically has two for loops as given below. Here the function ComputeForces()
calculates the force and updates the a_x[]
and a_z[]
values and after updating, V_Z and V_X are calculated again. V_X[]
and V_Z[]
are updatd in the 1st loop because ComputeForces()
needs the updated value to calculate a_x[]
and a_z[]
for(i=0; i<N_TOTAL; i++)
{
X[i] = X[i] + deltat_s * V_X[i] + 0.5 * dtSqr * a_x[i];
V_X[i] = V_X[i] + 0.5 * deltat_s * a_x[i];
Z[i] = Z[i] + deltat_s * V_Z[i] + 0.5 * dtSqr * a_z[i];
V_Z[i] = V_Z[i] + 0.5 * deltat_s * a_z[i];
}
ComputeForces ();
for(i=0; i<N_TOTAL; i++)
{
V_X[i] = V_X[i] + 0.5 * deltat_s * a_x[i];
V_Z[i] = V_Z[i] + 0.5 * deltat_s * a_z[i];
}
I have parallelized the code as follows:
#pragma omp parallel for ordered schedule (guided, 4) default(shared) private(i)
for(i=0; i<N_TOTAL; i++)
{
X[i] = X[i] + deltat_s * V_X[i] + 0.5 * dtSqr * a_x[i];
V_X[i] = V_X[i] + 0.5 * deltat_s * a_x[i];
Z[i] = Z[i] + deltat_s * V_Z[i] + 0.5 * dtSqr * a_z[i];
V_Z[i] = V_Z[i] + 0.5 * deltat_s * a_z[i];
}
ComputeForces ();
#pragma omp parallel for schedule (guided, 16) default(shared) private(i)
for(i=0; i<N_TOTAL; i++)
{
V_X[i] = V_X[i] + 0.5 * deltat_s * a_x[i];
V_Z[i] = V_Z[i] + 0.5 * deltat_s * a_z[i];
}
The issue i am facing is that when I parallelize the loops, it is taking more time than the serial code to execute. The time i obtained are as follows.
serial code : 3.546000e-02
parallel code : 4.632966e-02
It is a slight difference but when the values are updated for longer duration of time. in other words when the loops will be executed for say 1 X 10^6 times then this small increase in time will tend to add up and slow the code. I am unable to find the issue. I guess there could be a false sharing of the V_X[] and V_Z[] data. I have tried to change the chunk size to reduce false sharing i have tried different scheduling but unable to make it faster than the serial code.
Edit:
To check that i have given a run of 30000 steps. In each step the above code is evaluated once followed by few other calculations in serial. The time taken are as follows. Serial code : 1.102543e+03 and parallel code : 1.363923e+03. This is the problem i was talking about.
The processor I am using is Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz X 40. And the value of N_TOTAL = 1000