Parallelizing the loop increases the execution time

Question

I have this fragment of code which is the solution of newtons equation of motion and gives the position and velocities of the the particles for the next step. deltat_s and dtSqr are constants and X[], Z[], V_X[], V_Z[] are double type array holding the position and velocities values respectively a_x[], a_z[] are holding the acceleration values. where X and Z represent the x and z components of the vector. The fragment of code basically has two for loops as given below. Here the function ComputeForces() calculates the force and updates the a_x[] and a_z[] values and after updating, V_Z and V_X are calculated again. V_X[] and V_Z[] are updatd in the 1st loop because ComputeForces() needs the updated value to calculate a_x[] and a_z[]

for(i=0; i<N_TOTAL; i++)
  {
    X[i] = X[i] + deltat_s * V_X[i] + 0.5 * dtSqr * a_x[i];
    V_X[i] = V_X[i] + 0.5 * deltat_s * a_x[i];
    Z[i] = Z[i] + deltat_s * V_Z[i] + 0.5 * dtSqr * a_z[i];
    V_Z[i] = V_Z[i] + 0.5 * deltat_s * a_z[i];
  }

ComputeForces ();

for(i=0; i<N_TOTAL; i++)
  {
    V_X[i] = V_X[i] + 0.5 * deltat_s * a_x[i];
    V_Z[i] = V_Z[i] + 0.5 * deltat_s * a_z[i];
  }

I have parallelized the code as follows:

#pragma omp parallel for ordered schedule (guided, 4) default(shared) private(i)
for(i=0; i<N_TOTAL; i++)
      {
        X[i] = X[i] + deltat_s * V_X[i] + 0.5 * dtSqr * a_x[i];
        V_X[i] = V_X[i] + 0.5 * deltat_s * a_x[i];
        Z[i] = Z[i] + deltat_s * V_Z[i] + 0.5 * dtSqr * a_z[i];
        V_Z[i] = V_Z[i] + 0.5 * deltat_s * a_z[i];
      }

    ComputeForces ();

    #pragma omp parallel for schedule (guided, 16) default(shared) private(i)
    for(i=0; i<N_TOTAL; i++)
      {
        V_X[i] = V_X[i] + 0.5 * deltat_s * a_x[i];
        V_Z[i] = V_Z[i] + 0.5 * deltat_s * a_z[i];
      }

The issue i am facing is that when I parallelize the loops, it is taking more time than the serial code to execute. The time i obtained are as follows.

serial code : 3.546000e-02

parallel code : 4.632966e-02

It is a slight difference but when the values are updated for longer duration of time. in other words when the loops will be executed for say 1 X 10^6 times then this small increase in time will tend to add up and slow the code. I am unable to find the issue. I guess there could be a false sharing of the V_X[] and V_Z[] data. I have tried to change the chunk size to reduce false sharing i have tried different scheduling but unable to make it faster than the serial code.

Edit:

To check that i have given a run of 30000 steps. In each step the above code is evaluated once followed by few other calculations in serial. The time taken are as follows. Serial code : 1.102543e+03 and parallel code : 1.363923e+03. This is the problem i was talking about.

The processor I am using is Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz X 40. And the value of N_TOTAL = 1000

To check that i have given a run of 30000 steps. In each step the above code is evaluated once followed by few other calculations in serial. The time taken are as follows. Serial code : 1.102543e+03 and parallel code : 1.363923e+03. This is the problem i was talking about — Vishal Prajapati, Sep 06 '22 at 10:28
Can you specify the processor used to run this code? How big is `N_TOTAL` in practice? Why did you use the clause `ordered`? — Jérôme Richard, Sep 06 '22 at 10:31
@JérômeRichard I have used ordered because X[] must be calculated before calculating V_X[] same goes for Z[] and V_Z[] — Vishal Prajapati, Sep 06 '22 at 10:40
Thank you for the information. There is no need for the `ordered` clause to specify that: each iteration is sequenced in-order (but possibly executed out-of-order by the processor which respect the sequential consistency of each iteration). What is supposed to mean the `X 40` at the end of the processor specification? Do you have one or multiple sockets? (with 1 socket you should have 10 cores and 20 threads). If you are unsure, you can provide the result of the `lstopo` command. — Jérôme Richard, Sep 06 '22 at 10:48
@JérômeRichard Yes i have multiple socket, 2 to be precise giving 20 cores and 40 threads. However I have given the runs related to my question with only 8 threads. — Vishal Prajapati, Sep 06 '22 at 10:52
With 1000 iterations and 8 threads each thread does a good 120 operations, which is 30 AVX operations, so 30 cycles. The barrier at the end of an OMP loop can easily take a few hundred cycles. 2. Please learn an editor that does indentation. At first I thought `ComputeForces` was inside the loop. 3. Declare your loop variables inside the loop: `for (int i=whatever` so you don't have to worry about making them private. It's even better coding style. — Victor Eijkhout, Sep 06 '22 at 13:25
@VictorEijkhout Just a minor point: the target Broadwell processor can execute [2 AVX FMA instructions per cycle](https://stackoverflow.com/questions/15655835) (like most mainstream desktop/server processors). The loop should do about 6 instruction per iteration (not one). So it should be something like 180 AVX operations and 90 cycles (+ a latency of <10 cycles and a lower frequency due AVX being used). It is still very small though. — Jérôme Richard, Sep 11 '22 at 10:32

score 3 · Answer 1 · answered Sep 06 '22 at 11:46

The are several issue with the parallelization method.

First of all, schedule(guided, 4) can cause a false-sharing issues due to shared cache-lines between threads. Indeed, double-precision (DP) floating-point (FP) numbers should take 8 bytes in memory and a cache line is 64 bytes on your target processor (not to mention some Intel processor prefetch data by 128 bytes). Using a bigger scheduling grain like >= 16 should help.

Moreover, a guided schedule cause issue on NUMA systems like your target platform. Put it shortly, each processor has its own RAM, and while other processors of the same computing node can access to its RAM, remote accesses are more expensive and can be imbalanced. Indeed, it cause core to perform non-local access in a non-deterministic way. In the worst case, data may lies in the same NUMA node causing 1 NUMA node to be saturated and the other to be unused. The NUMA policy must be carefully controled to mitigate/avoid NUMA effects. Since your code is memory bound, this effect can make parallelisation completely useless (using more core when the RAM is saturated does not make a code run faster). It is better to use a static schedule with data well distributed amongst NUMA nodes. A common solution is to fill arrays in parallel during the initialization so the first touch policy (generally the default one) can distribute the memory pages properly (assuming the schedule clause is the same). Alternatively, you can use numactl. You can find more information in this post and this one. Besides node that binding threads often has a significant impact on performance in this context (using 4 core of 2 NUMA nodes should result in better performance than 8 core lying on the same NUMA node assuming your application supports well NUMA systems).

Finally, the number of iteration is certainly too small for multiple threads to be useful. Indeed, the loop can be vectorized using the AVX and FMA3 instruction sets that can compute 4 double precision numbers per instruction. The target processor can compute 2 FMA instruction per cycle. This means 8 DP FP numbers per cycle. Thus, I expect the computation time to be clearly less than a microsecond. Most of the time should be taken by the memory accesses. This is especially true in parallel since the guided schedule causes data to move between caches of different core rather than staying in the L1/L2 cache (which are significantly faster than the shared L3 cache). Moreover, distributing the work between cores take some time (especially when the target cores are lying on another processor). One should also note that each parallel for ends with an implicit barrier that also add some overhead. All these overheads are generally significantly more than 1 us (this is very dependent of the target machine and OpenMP runtime though).

Note that AVX/AVX2 is only used if your code is compiled with -march=native or -mavx or similar flags. By default, compilers use older instruction sets which are slower but compatible with (much) older machines.

I think your discussion about the guided schedule is, while totally correct, somewhat beside the point: you can just say "there is no reason to use anything but the default static schedule". — Victor Eijkhout, Sep 06 '22 at 13:20
@VictorEijkhout Indeed, the answer is a bit long. I was unsure about the intent of the OP about the guided schedule. Such schedule if generally used to mitigate a load imbalance which can be caused by NUMA effect (but not a good idea to use in that case). — Jérôme Richard, Sep 11 '22 at 10:19

score 0 · Answer 2 · answered Sep 06 '22 at 13:27

There are two very likely explanations to your observation:

The amount of calculations within each parallel region is too small (a loop executed 30000 times with only a few simple operations is nothing nowadays). OpenMP has some overheads that can cancel the benefits of a parallel execution in such cases
Your code is memory bound, or it becomes memory bound once parallelized on 8 threads. "Memory bound" means that the bottleneck is not the CPU load, but the data transfer rate between the memory (RAM) and the CPU. In that case no performance improvement can be obtained with parallelization. When working on arrays, a code is memory bound when it performs only few and simple operations per array element.

Parallelizing the loop increases the execution time

2 Answers2