Is it possible to evaluate OpenMP overhead cost?

Question

I am trying to optimize a code about image processing with OpenMP. I am still learning it and I came to the classic parallel for loop which is slower than my single threaded implementation.

The for loop I tried to parallelize has only 50 iterations and I saw it oculd explain why the OpenMP use in this case could be useless due to overhead operations.

But is it possible to evaluate these costs ? What are the cost differences between shared / private / firstprivate (...) clauses ?

This is the for loop I want to parallelize:

#pragma omp parallel for \
      shared(img_in, img_height, img_width, w, update_gauss, MoGInitparameters, nb_motion_bloc) \
      firstprivate(img_fg, gaussStruct) \
      private(ContrastHisto, j, x, y) \
      schedule(dynamic, 1) \
      num_threads(max_threads) 
  for(i = 0; i<h; i++){ // h=50
    for(j = 0; j<w; j++){
      ComputeContrastHistogram_generic1D_rect(img_in,  H_DESC_STEP*i, W_DESC_STEP*j, ContrastHisto, H_DESC_SIZE, W_DESC_SIZE, 2, img_height, img_width);
      x = i;
      y = w - 1 - j;
      img_fg->data[y][x] = 1-MatchMoG_GaussianInt(ContrastHisto, gaussStruct->gauss[i*w+j], update_gauss, MoGInitparameters);;
      #pragma omp critical
      {
        nb_motion_bloc += img_fg->data[y][x];
        nb_motion_bloc += img_fg->data[y][x];
      }
    }
  }

Maybe I am doing some mistakes but if that's the case please tell me why !!

Yes, it is possible to evaluated the overhead from various OpenMP constructs - see the [EPCC OpenMP micro-benchmark suite](https://www.epcc.ed.ac.uk/research/computing/performance-characterisation-and-benchmarking/epcc-openmp-micro-benchmark-suite). By the way, it looks like you are performing a sum reduction into `nb_motion_bloc` and should therefore use `reduction(+:nb_motion_bloc)` instead having the summation in a critical block. Also, `schedule(dynamic,1)` is probably very suboptimal in your case. — Hristo Iliev, Feb 25 '15 at 07:58

score 2 · Answer 1 · edited May 23 '17 at 12:20

Several points that don't address your specific questions.

Try to avoid #pragma omp critical. In this case, you can remove it altogether by adding a reduction(+:nb_motion_bloc ) clause to the omp parallel for line and using nb_motion_bloc += 2*img_fg->data[y][x];.

Depending on how much work each iteration has to do, short loops incur (much) more overhead than they're worth.

Now to the questions. If you don't change any of variables, don't bother classifying them as shared/private/firstprivate. If they are supposed to be used by each thread and discarded, you can use a construct like

#pragma omp parallel
{
    int x, y;
    #pragma omp for
    for(i = 0; i<h; i++)
    {
        ...
    }    
}

If the workload is balanced, then consider using schedule(static).

As to the differences between shared/private/firstprivate see this and this questions. From wikipedia:

shared: the data within a parallel region is shared, which means visible and accessible by all threads simultaneously. By default, all variables in the work sharing region are shared except the loop iteration counter.

private: the data within a parallel region is private to each thread, which means each thread will have a local copy and use it as a temporary variable. A private variable is not initialized and the value is not maintained for use outside the parallel region. By default, the loop iteration counters in the OpenMP loop constructs are private.

default: allows the programmer to state that the default data scoping within a parallel region will be either shared, or none for C/C++, or shared, firstprivate, private, or none for Fortran. The none option forces the programmer to declare each variable in the parallel region using the data sharing attribute clauses.

firstprivate: like private except initialized to original value.

lastprivate: like private except original value is updated after construct.

Is it possible to evaluate OpenMP overhead cost?

1 Answers1