OMP Atomic, why?

Question

I've got a problem with OpenMP. I know that if you are incrementing something in a parallel block you have to set an atomic before that expression. But in my code there is a part I don't understand.

Why do I have to use the atomic here?

#pragma omp parallel
{
  double distance, magnitude, factor, r;
  vector_t direction;
  int i, j;
#pragma omp for
  for (i = 0; i < n_body - 1; i++)
    {
      for (j = i + 1; j < n_body; j++)
        {
          r = SQR (bodies[i].position.x - bodies[j].position.x) + SQR (bodies[i].position.y - bodies[j].position.y);
          // avoid numerical instabilities
          if (r < EPSILON)
            {
              // this is not how nature works :-)
              r += EPSILON;
            }
          distance = sqrt (r);
          magnitude = (G * bodies[i].mass * bodies[j].mass) / (distance * distance);

          factor = magnitude / distance;
          direction.x = bodies[j].position.x - bodies[i].position.x;
          direction.y = bodies[j].position.y - bodies[i].position.y;

          // +force for body i
      #pragma omp atomic
          bodies[i].force.x += factor * direction.x;
      #pragma omp atomic
          bodies[i].force.y += factor * direction.y;

          // -force for body j
      #pragma omp atomic
          bodies[j].force.x -= factor * direction.x;
      #pragma omp atomic
          bodies[j].force.y -= factor * direction.y;
        }
    }
    }

And why don't I have to use it here:

#pragma omp parallel
{
  vector_t delta_v, delta_p;
  int i;

#pragma omp for
  for (i = 0; i < n_body; i++)
    {
      // calculate delta_v
      delta_v.x = bodies[i].force.x / bodies[i].mass * dt;
      delta_v.y = bodies[i].force.y / bodies[i].mass * dt;

      // calculate delta_p
      delta_p.x = (bodies[i].velocity.x + delta_v.x / 2.0) * dt;
      delta_p.y = (bodies[i].velocity.y + delta_v.y / 2.0) * dt;

       // update body velocity and position
      bodies[i].velocity.x += delta_v.x;
      bodies[i].velocity.y += delta_v.y;
      bodies[i].position.x += delta_p.x;
      bodies[i].position.y += delta_p.y;

      // reset forces
      bodies[i].force.x = bodies[i].force.y = 0.0;


      if (bounce)
    {
      // bounce on boundaries (i.e. it's more like billard)
      if ((bodies[i].position.x < -body_distance_factor) || (bodies[i].position.x > body_distance_factor))
        bodies[i].velocity.x = -bodies[i].velocity.x;
      if ((bodies[i].position.y < -body_distance_factor) || (bodies[i].position.y > body_distance_factor))
        bodies[i].velocity.y = -bodies[i].velocity.y;
    }
    }
}

The code works at it is now, but I simply don't understand why. Can you help me?

Kind Regards Michael

You might consider exploiting Newton's 2nd Law and/or getting rid of the atomic construct, as it is possible expensive (FP FMA isn't a HW atomic). Look at the LAMMPS OpenMP force kernels for examples. You might consider looping over i<=j and then doing a second i>j loop to set f(i,j) = -f(j,i). Other possibilities exist. — Jeff Hammond, Sep 19 '13 at 14:38
ah nevermind. "#pragma omp for collapse(<# level for loops>)" http://bisqwit.iki.fi/story/howto/openmp/#BarrierDirectiveAndTheNowaitClause — kchoi, Sep 20 '13 at 20:28
Okay, I finished my edits. Check out the code. It probably still has a few bugs still though but I hope you can get the main ideas from it. — Z boson, Oct 31 '13 at 15:55
This is not an answer to your question (which has already been given), but you might still consider it: First, on your calculation for r you are using the difference in the x and y positions and a few lines later for the direction you are calculating them again. Calculate the direction first and use that for the calculation of r. And second, in your magnitude calculation you are using distance * distance, which is just r. — el.mojito, Oct 31 '13 at 13:23

score 3 · Answer 1 · answered Sep 16 '13 at 18:13

The second of the two code samples, each parallel iteration of the loop works on element [i] of the array and never looks at any neighbouring elements. Thus each iteration of the loop has no effect on any other iteration of the loop, and they can all be executed at the same time without worry.

In the first code example however, you each parallel iteration of the loop may read from and write anywhere in the bodies array using the index [j]. This means that two threads could be trying to update the same memory location at the same time, or one thread could be writing to a location that another one is reading. To avoid race conditions you need to ensure that the writes are atomic.

while the outer loop [i] may not clash, inner loop [j] may overlap since it's iterating over [i+1, nbody) for every i — kchoi, Sep 20 '13 at 19:43

score 0 · Answer 2 · edited May 23 '17 at 11:50

When multiple threads write to the same memory location you need to use an atomic operator or a critical section to prevent a race condition. Atomic operators are faster but have more restrictions (e.g. they only operate on POD with some basic operators) but in your case you can use them.

So you have to ask yourself when threads are writing to the same memory location. In the first case you only parallelize the outer loop over i and not the inner loop over j so you don't actually need the atomic operators on i just the ones on j.

Let's consider an example from the first case. Let's assume n_body=101 and there are 4 threads.

Thread one   i =  0-24, j =  1-100, j range = 100
Thread two   i = 25-49, j = 26-100, j range = 75
Thread three i = 50-74, j = 51-100, j range = 50
Thread four  i = 75-99, j = 76-100, j range = 25

First of all you see that each thread write to some of the same memory location. For example, all threads write to memory locations with j=76-100. That's why you need the atomic operator for j. However, no thread writes to the same memory location with i. That's why you don't need an atomic operator for i.

In your second case you only have one loop and it's parallelized so no thread writes to the same memory location so you don't need the atomic operators.

That answers your question but here are some addition comments to improve the performance of your code:

There is another important observation independent of the atomic operators. You can see that thread one runs over j 100 times while thread 4 only runs over j 25 times. Therefore, the load is not well distributed using schedule(static) which is typically the default scheduler. For larger values of n_body this is going to get worse.

One solution is to try schedule(guided). I have not used this before but I think it's the right solution OpenMP: for schedule. "A special kind of dynamic scheduling is the guided where smaller and smaller iteration blocks are given to each task as the work progresses." According to the standard for guided each successive block gets "number_of_iterations_remaining / number_of_threads". So from our example that gives

Thread one   i =  0-24, j =  1-100, j range = 100
Thread two   i = 25-44, j = 26-100, j range = 75
Thread three i = 45-69, j = 46-100, j range = 55
Thread four  i = 60-69, j = 61-100, j range = 40
Thread one   i = 70-76, j = 71-100
...

Notice now that the threads are more evenly distributed. With static scheduling the fourth thread only ran over j 25 times and now the fourth thread runs over j 40 times.

But let's look at your algorithm more carefully. In the first case you are calculating the gravitational force on each body. Here are two way you can do this:

//method1 
#pragma omp parallel for
for(int i=0; i<n_body; i++) {
    vector_t force  = 0;
    for(int j=0; j<n_body; j++) {
        force += gravity_force(i,j);
    }
    bodies[i].force = force;
}

But the function gravity_force(i,j) = gravity_force(j,i) so you don't need to calculate it twice. So you found a faster solution:

//method2 (see your code and mine below on how to parallelize this)
for(int i=0; i<(n_body-1); i++) {
    for(int j=i+1; j<nbody; j++) {
        bodies[i].force += gravity_force(i,j);
        bodies[j].force += gravity_force(i,j);
    }
}

The first method does n_bodies*n_bodies iterations and the second method does (n_bodies-1)nbodies/2 iterations which to first order is n_bodiesn_bodies/2. However, the second case is much more difficult to parallelize efficiently (see your code and my code below). It has to use atmoic operations and the load is not balance. The first method does twice as many iterations but the load is evenly distributed and it does not need any atomic operations. You should test both methods to see which one is faster.

To parallize the second method you can do what you did:

#pragma omp parallel for schedule(static) // you should try schedule(guided)
for(int i=0; i<(n_body-1); i++) {
    for(int j=i+1; j<nbody; j++) {
        //#pragma omp atomic //not necessary on i
        bodies[i].force += gravity_force(i,j);
        #pragma omp atomic   //but necessary on j
        bodies[j].force += gravity_force(i,j);
    }
}

Or a better solution is to use private copies of force like this:

#pragma omp parallel 
{
    vector_t *force = new vector_t[n_body];
    #pragma omp for schedule(static) 
    for (int i = 0; i < n_body; i++) force[i] = 0;
    #pragma omp for schedule(guided)
    for(int i=0; i<(n_body-1); i++) {
        for(int j=i+1; j<nbody; j++) {
            force[i] += gravity_force(i,j);
            force[j] += gravity_force(i,j);
        }
    }
    #pragma omp for schedule(static)
    {
         #pragma omp atomic
         bodies[i].force.x += force[i].x;
         #pragma omp atomic
         bodies[i].force.y += force[i].y;
    }
    delete[] force;
}

OMP Atomic, why?

2 Answers2