explict flush directive with OpenMP: when is it necessary and when is it helpful

Question

One OpenMP directive I have never used and don't know when to use is flush(with and without a list).

I have two questions:

1.) When is an explicit `omp flush` or `omp flush(var1, ...) necessary?  
2.) Is it sometimes not necessary but helpful (i.e. can it make the code fast)?

The main reason I can't understand when to use an explicit flush is that flushes are done implicitly after many directives (e.g. as barrier, single, ...) which synchronize the threads. I can't, for example, see way using flush and not synchronizing (e.g. with nowait) would be helpful.

I understand that different compilers may implement omp flush in different ways. Some may interpret a flush with a list as as one without (i.e. flush all shared objects) OpenMP flush vs flush(list). But I only care about what the specification requires. In other words, I want to know where an explicit flush in principle may be necessary or helpful.

Edit: I think I need to clarify my second question. Let me give an example. I would like to know if there are cases where removing an implicit flush (e.g. with nowait) and instead using an explicit flush instead but only on certain shared variables would be faster (and still give the correct result). Something like the following:

float a,b;
#pragma omp parallel
{
    #pragma omp for nowait // No barrier.  Do not flush on exit.
        //code which uses only shared variable a        
    #pragma  omp flush(a) // Flush only variable a rather than all shared variables.       
    #pragma omp for
       //Code which uses both shared variables a and b.
}

I think that code still needs a barrier after the the first for loop but all barriers have an implicit flush so that defeats the purpose. Is it possible to have a barrier which does not do a flush?

Michael Klemm · Accepted Answer · 2019-04-27T13:30:04.717

The flush directive tells the OpenMP compiler to generate code to make the thread's private view on the shared memory consistent again. OpenMP usually handles this pretty well and does the right thing for typical programs. Hence, there's no need for flush.

However, there are cases where the OpenMP compiler needs some help. One of these cases is when you try to implement your own spin lock. In these cases, you would need a combination of flushes to make things work, since otherwise the spin variables will not be updated. Getting the sequence of flushes correct will be tough and will be very, very error prone.

The general recommendation is that flushes should not be used. If at all, programmers should avoid flush with a list (flush(var,...)) at all means. Some folks are actually talking about deprecating it in future OpenMP.

Performance-wise the impact of flush should be more negative than positive. Since it causes the compiler to generate memory fences and additional load/store operations, I would expect it to slow down things.

EDIT: For your second question, the answer is no. OpenMP makes sure that each thread has a consistent view on the shared memory when it needs to. If threads do not synchronize, they do not need to update their view on the shared memory, because they do not see any "interesting" change there. That means that any read a thread makes does not read any data that has been changed by some other thread. If that would be the case, then you'd have a race condition and a potential bug in your program. To avoid the race, you need to synchronize (which then implies a flush to make each participating thread's view consistent again). A similar argument applies to barriers. You use barriers to start a new epoch in the computation of a parallel region. Since you're keeping the threads in lock-step, you will very likely also have some shared state between the threads that has been computed in the previous epoch.

BTW, OpenMP may keep private data for a thread, but it does not have to. So, it is likely that the OpenMP compiler will keep variables in registers for a while, which causes them to be out of sync with the shared memory. However, updates to array elements are typically reflected pretty soon in the shared memory, since the amount of private storage for a thread is usually small (register sets, caches, scratch memory, etc.). OpenMP only gives you some weak restrictions on what you can expect. An actual OpenMP implementation (or the hardware) may be as strict as it wishes to be (e.g., write back any change immediately and to flushes all the time).

Do you have some example code (e.g. a link) of this spin lock with OpenMP? Or for that matter any example code which shows an explicit flush being necessary? — Z boson, Oct 31 '13 at 08:45
Frankly, I do not want to promote the existence of flush, since I truly believe that one should not use it. — Michael Klemm, Nov 01 '13 at 13:36
Okay, thank you for your answer. I think if I implemented a toy OpenMP API with pthreads this would make more sense. I have been meaning to do that... — Z boson, Nov 01 '13 at 14:24
If you are working to implement your own OpenMP implementation, you can at first start with a strict implementation that keeps the view consistent for most of the time. Weakening the memory model afterwards can be seen as an optimization. You can pretty much rely on the optimization capabilities of the base language compiler, with some memory fences added at the right place. — Michael Klemm, Nov 01 '13 at 15:42
@Michael How do you deal with for instance a scheduler that does work for the main thread and reports its status? I don't see how to push the status update to the other thread without using a flush. — Chiel, Oct 01 '14 at 15:29
If the master thread and another worker thread need to communicate, then there is a shared variable or memory area involved. To avoid race conditions, you should either use the critical construct, OpenMP locks, or the atomic construct to protect that shared memory location. In these cases, the flush is implied, for atomic you might need the cst_seq clause to ensure proper read/write fences. — Michael Klemm, Oct 02 '14 at 16:43

Daniel Langr · Answer 2 · 2017-02-23T11:46:20.530

Not exactly an answer, but Michael Klemm's question is closed for comments. I think an excellent example of why flushes are so hard to understand and use properly is the following one copied (and shortened a bit) from the OpenMP Examples:

//http://www.openmp.org/wp-content/uploads/openmp-examples-4.0.2.pdf
//Example mem_model.2c, from Chapter 2 (The OpenMP Memory Model)
int main() {
   int data, flag = 0;
   #pragma omp parallel num_threads(2)
   {
      if (omp_get_thread_num()==0) {
         /* Write to the data buffer that will be read by thread */
         data = 42;
         /* Flush data to thread 1 and strictly order the write to data
            relative to the write to the flag */
         #pragma omp flush(flag, data)
         /* Set flag to release thread 1 */
         flag = 1;
         /* Flush flag to ensure that thread 1 sees S-21 the change */
         #pragma omp flush(flag)
      }
      else if (omp_get_thread_num()==1) {
         /* Loop until we see the update to the flag */
         #pragma omp flush(flag, data)
         while (flag < 1) {
            #pragma omp flush(flag, data)
         }
         /* Values of flag and data are undefined */
         printf("flag=%d data=%d\n", flag, data);
         #pragma omp flush(flag, data)
         /* Values data will be 42, value of flag still undefined */
         printf("flag=%d data=%d\n", flag, data);
      }
   }
   return 0;
}

Read the comments and try to understand.

Since you just copied and pasted the code from the manual I added the name of the code and location — Z boson, Feb 23 '17 at 10:24

explict flush directive with OpenMP: when is it necessary and when is it helpful

2 Answers2

Linked