I don't know what is wrong with this loop parallelization

Question

I do not know why, but the parallelization of this loop is not working fine:

#pragma omp parallel for private(d) 
for ( i = 0 ; i < nt1 ; i++ ) {
  d = distancia(tabla1[i],tabla2[j]);
  if ( d < dm ) {
    #pragma omp critical
    if ( d < dm ) {
      dm = d;
      im = i;
    }
  }
}
ps[j] = im;

The d variable is private and the dm and im are inside a critical zone but when I exec this code (the full code is bigger, this is just a small part of it) the output changes from the sequential version. Thanks!

Please provide a [mcve] with a clear description of expected and actual output. Note: You must not use a double-checked-lock-pattern like this. It is incorrect to even read `dm` unprotected. It is, however, unlikely that this is the only problem in your code. — Zulan, Nov 13 '18 at 15:15
This looks like it might appear inside a loop over variable `j`. If so, then consider parallelizing the outer loop instead of the inner. Very possibly, you will that way be able to do without a critical section, which would be an enormous win. — John Bollinger, Nov 13 '18 at 19:28

Brice · Accepted Answer · 2018-11-14T10:38:47.093

The first if ( d < dm ) is outside of a critical section and could be the cause of a race condition. It should be removed, since there is already one in the critical section. If your purpose is to avoid going too often in the critical part, then this is not the way to do that (see below). You may have to manually flush dm too (see here for some explanations on #pragma omp flush).

The distancia function may also not be thread-safe (i.e. rely on some global variable that cannot be concurrently edited by several threads).

Also, it is normal that the result differ if some values of distance are equal. Since there isn't a rule to specifically treat that case (i.e. take the smallest im), the index of the first minimum value encountered will be kept: it may not be the same in the serial and parallel versions as the iterations are not done in the same order.

Unless the calculation of the distancia function is CPU intensive, your code is likely to run slower than a serial code for many reasons. Notably, the synchronization overhead of the critical section and the cache sharing issue with variable dm. It is better to use a reduction approach, where each thread computes the minimum of whatever iteration it was given to process, and the global minimum is taken as the smallest local minimum of all threads.

float dm=INFINITY;
int im=-1;

#pragma omp parallel shared(dm,im)
{
  // Declare private variables
  float d,dmin_thread,imin_thread;
  dmin_thread=INFINITY;
  imin_thread=-1;

  // loop through i
  #pragma omp for
  for ( i = 0 ; i < nt1 ; i++ ) {
    d = some_function(i);
    // the following lines operate on thread-private variables
    if ( d < dmin_thread) {
      dmin_thread= d;
      imin_thread= i;
      }
    }// end for

  // Reduction part after the loop has been completed
  // The critical section is entered only once by each thread
  #pragma omp critical
  {
    if ( dmin_thread < dm) {
      dm = dmin_thread;
      im = imin_thread;
      }
  }//end critical

}//end parallel

if(im==-1)
  throw_some_error();
else
  do_something_with(im);

edoelas · Answer 2 · 2018-11-16T22:41:04.930

I believe the problem was that even if it was a critical zone, in the sequential version we will always get the lowest i value and adding this condition: (dm == d && i < im) in the conditionals we make sure that the code still works in the same way. I'm not sure if this the best solution, but it's pretty simple and worked for me. Here is the full code with the changes:

for ( j = 0 ; j < nt2 ; j++ ) {
  dm = MAX_LON + 1;
  #pragma omp parallel for private(d) 
  for ( i = 0 ; i < nt1 ; i++ ) {
    d = distancia(tabla1[i],tabla2[j]);
    if ( d < dm  || (dm == d && i < im)) {
      #pragma omp critical
        {
        if ( d < dm  || (dm == d && i < im)) {
          dm = d;
          im = i;
        }
        }
    }
  }
  ps[j] = im;
}

I don't know what is wrong with this loop parallelization

2 Answers2