OpenMP parallelizing loop

Question

I want to parallelize that kind of loop. Note that each "calc_block" uses the data that obtained on previous iteration.

for (i=0 ; i<MAX_ITER; i++){

    norma1 = calc_block1();
    norma2 = calc_block2();
    norma3 = calc_block3();
    norma4 = calc_block4();

    norma = norma1+norma2+norma3+norma4;
    ...some calc...
    if(norma<eps)break;
}

I tryed this, but speedup is quite small ~1.2

for (i=0 ; i<MAX_ITER; i++){
  #pragma omp parallel sections{
     #pragma omp section
       norma1 = calc_block1();
     #pragma omp section
       norma2 = calc_block2();
     #pragma omp section
       norma3 = calc_block3();
     #pragma omp section
       norma4 = calc_block4();
  }

  norma = norma1+norma2+norma3+norma4;
    ...some calc...
  if(norma<eps)break;
}

I think it happened because of the overhead of using sections inside of loop. But i dont know how to fix it up... Thanks in advance!

What's the value of `MAX_ITER`? What is the absolute time cost of the whole code and each block, respectively? — kangshiyin, Oct 29 '13 at 17:48

score 3 · Accepted Answer · edited May 23 '17 at 12:15

3

You could reduce the overhead by moving the entire loop inside the parallel region. Thus the threads in the pool used to implement the team would only get "awaken" once. It is a bit tricky and involves careful consideration of variable sharing classes:

#pragma omp parallel private(i,...) num_threads(4)
{
   for (i = 0; i < MAX_ITER; i++)
   {
      #pragma omp sections
      {
         #pragma omp section
         norma1 = calc_block1();
         #pragma omp section
         norma2 = calc_block2();
         #pragma omp section
         norma3 = calc_block3();
         #pragma omp section
         norma4 = calc_block4();
      }

      #pragma omp single
      {
         norma = norm1 + norm2 + norm3 + norm4;
         // ... some calc ..
      }

      if (norma < eps) break;
   }
}

Both sections and single constructs have implicit barriers at their ends, hence the threads would synchronise before going into the next loop iteration. The single construct reproduces the previously serial part of your program. The ... part in the private clause should list as many as possible variables that are only relevant to ... some calc .... The idea is to run the serial part with thread-local variables since access to shared variables is slower with most OpenMP implementations.

Note that often time the speed-up might not be linear for completely different reason. For example calc_blockX() (with X being 1, 2, 3 or 4) might have too low compute intensity and therefore require very high memory bandwidth. If the memory subsystem is not able to feed all 4 threads at the same time, the speed-up would be less than 4. An example of such case - this question.

edited May 23 '17 at 12:15

Community

1
1

answered Oct 29 '13 at 18:44

Hristo Iliev

72,659
12
135
186

I'm a bit confused why you used `flush`. From what I understand `single` has an implicit barrier which calls an implicit `flush` on exit and all all shared objects are synchronized which should include `norma`. I have never used `flush` so it's something I want to learn about. – Z boson Oct 30 '13 at 08:09
@redrum, you've made a valid argument against the use of `flush`. – Hristo Iliev Oct 30 '13 at 08:28
Accodring to this IBM [link](http://publib.boulder.ibm.com/infocenter/lnxpcomp/v8v101/index.jsp?topic=%2Fcom.ibm.xlcpp8l.doc%2Fcompiler%2Fref%2Fruompflu.htm)(and others) for an implicit flush " all shared objects are synchronized except those inaccessible with automatic storage duration." What does the last part of the sentence "except those inaccessible with automatic storage duration" mean? That makes me a bit nervous. – Z boson Oct 30 '13 at 08:42
I have no idea. No such language is present in the OpenMP specification. "Automatic storage duration" usually translates to "local stack variables" and it probably means that a flush called in a function only synchronises the subset of shared stack variables from the main thread which is visible (accessible) in the function called. – Hristo Iliev Oct 30 '13 at 09:04
Could you do `omp single nowait` which would remove the implicit flush and then do `omp flush(norma)` which would only explicitly flush `norma` instead of all shared objects? But then maybe you would still need a barrier in the for loop which would have an implicit flush anyway? This `flush` directive is confusing. – Z boson Oct 30 '13 at 09:37
1

`flush` is only useful in practice in combination with `atomic` or other manual synchronisation primitives (i.e. not `barrier`, explicit or not). It is tricky indeed. In my experience, for example, GCC always uses pointers to access shared variables (i.e. it doesn't implement the relaxed memory model) and `flush` directives are simply translated into an `MFENCE` instruction that flushes all preceding out-of-order fetches and stores. In other implementations `flush` might act as `volatile` instructing the compiler to generate actual memory stores where the code was register optimised. – Hristo Iliev Oct 30 '13 at 13:11

OpenMP parallelizing loop

1 Answers1