0

I am attempting to use OpenMP in my project that contains N agents in a simulation. Every agent has a position vector. Inside my program, I do something like the following:

for(int i=0;i<agents.size();i++){
 for(int j=0;j<agents.size();j++){
   if(i==j) continue;
   if(distance_between(i,j)<10){
     //do a lot of calculations for interaction between i and j,
     //and a bunch of changes to variables of the Agents stored in the list
   }
 }
}

Am I able to simply prepend "#pragma omp parallel for" in front of the first loop to distribute the work? If I do it, my program runs much faster, but I am concerned that the calculations it is doing may not be accurate.

The order of these calculations does not matter, as long as every i,j pair is processed, and no variable outside of the variables defined inside the innermost loop, and the variables in my Agents class are ever changed (only read).

Not being a systems guy I have trouble understanding inner workings of OpenMP, and have become slightly suspicious of it. Thanks for any help with this!

karpathy
  • 1,393
  • 1
  • 15
  • 19

3 Answers3

2

Yes, adding the pragma would be correct. The memory is considered implicitly shared by all threads. Still that doesn't mean it will work faster. There are multiple issues to be considered:

  • how many processors does your system have ?
  • do you use integer or floating point ? For integer the order will not matter, but this is not true for floating point
  • what variables are accessed only by the innermost loop ? You could declare them private to obtain better performance.
vladmihaisima
  • 2,119
  • 16
  • 20
  • At http://stackoverflow.com/questions/2100490/floating-point-inaccuracy-examples you can find some examples of problems with the floating point. When using multiple threads, problem 3 (rounding) will affect the result (the difference can be small, but this depends on the algorithm) – vladmihaisima May 07 '15 at 17:01
0

Am I able to simply prepend "#pragma omp parallel for" in front of the first loop to distribute the work?

It depends on what would you do in parallel section. To gain better performance you might change the algorithm of your parallel code rather than just put the #pragma omp parallel to your sequence code. Remember that, the key of parallel programming are the shared variables. For some cases, it would be better to use sequence code than parallel one.

If I do it, my program runs much faster, but I am concerned that the calculations it is doing may not be accurate.

Indeed, you have to make sure that there's nothing a race-condition in your code to get the correct calculation result.

0

Make sure interactions i,j and j,i which affect the same data don't cause data races. You can do so carefully tuning the distribution of the work or by using omp critical or omp atomic constructs where necessary.

Apart from that, there is no problem at all on making your code run faster by just using omp parallel for constructs on the outer loop. Bear in mind that if the amount of processing units (cores, hwthreads, etc.) your processor has, or if you are using a ccNUMA machine, it might be a nice idea to do some additional things, as the scalability of your code will not be as good as it could.

The easiest improvement is to add collapse(2) to your omp for clause. This will tell OpenMP that you want to distribute iterations of both of those loops altogether.

#pragma omp parallel for collapse(2)
for(int i=0;i<agents.size();i++){
 for(int j=0;j<agents.size();j++){
   if(i==j) continue;
   if(distance_between(i,j)<10){
     //do a lot of calculations for interaction between i and j,
     //and a bunch of changes to variables of the Agents stored in the list
   }
  }
}

Once you reach huge amounts of particles (or agents as you call them), it might be wise to sort all of them according to their position. This will cause your if condition inside the innermost loop to have a much more stable behavior, thus allowing a better prediction of the branches (see why is it faster to process a sorted array).

In addition, you can delete the branch completely if you divide the geometric space in bins. This is because you exactly know that non-adjacent buckets are too far for the condition to become successful.

This kind of problems is very well known and, as you might already know, they are called n-body problems. The bucket optimization version is called Barnes-Hut simulation, although you might find other approaches working much faster (although more difficult to parallelize, still are more efficient most of the time) such as the Fast Multipole Method, which consists more or less in reducing the number of computations by solving both interactions i,j and j,i in the same step.

Jorge Bellon
  • 2,901
  • 15
  • 25