1

If I have the following case:

bool cond_var;

#pragma omp parallel shared(cond_var)
{
    bool some_private_var;
    // ...
    do {
       #pragma omp single
       {
           cond_var = true;
       }

       // do something, calculate some_private_var;
       // ...

       #pragma omp atomic update
       cond_var &= some_private_var;

       // Syncing step
       // (???)

    } while(cond_var);

    // ... (other parallel stuff)
}

I want my do-while loop to have the same number of iterations for all my threads, but when I tried #pragma omp barrier as the syncing step (just before the end of the loop), I have ended with a deadlock. Printing the value of cond_var showed that some threads saw it as true while others saw it as false, so the loop finished for some, leaving the others deadlocked on the barrier. Then I have tried various combinations and ordering of barrier and flush, with no luck (with some combinations, the deadlock was postponed).

How to properly combine and sync the loop condition among the threads so the all the loops will have the same number of iterations?

UPDATE

I have also tried loading the value of cond_var to another private variable with #pragma atomic read, and testing that condition. It also didn't work. Apparently, atomic read guarantee I have a consistent value (either old or new), but doesn't guarantee it is the latest.

UPDATE 2

Based on code Jonathan Dursi's code, this is an MVCE that look more like what I am trying to do:

#include <omp.h>
#include <cstdio>
#include <random>
#include <chrono>
#include <thread>

int main() {

    bool cond_var;
    const int nthreads = omp_get_max_threads();

    #pragma omp parallel default(none) shared(cond_var)
    {
        bool some_private_var;
        std::random_device rd;
        std::mt19937 rng(rd());
        unsigned iter_count = 0;

        /* chance of having to end: 1 in 6**nthreads; all threads must choose 0 */
        std::uniform_int_distribution<int> dice(0,5);

        const int tid = omp_get_thread_num();
        printf("Thread %d started.\n", tid);
        do {
            ++iter_count;

            #pragma omp once shared(cond_var)
            {
                // cond_var must be reset to 'true' because it is the
                // neutral element of &
                // For the loop to end, all threads must choose the
                // same random value 0
                cond_var = true;
            }

            some_private_var = (dice(rng) == 0);

            // If all threads choose 0, cond_var will remain 'true', ending the loop
            #pragma omp atomic update
            cond_var &= some_private_var;

            #pragma omp barrier
        } while(!cond_var);
        printf("Thread %d finished with %u iterations.\n", tid, iter_count);
    }

    return 0;
}

Running with 8 threads in a machine with enough logical cores to run all of them simultaneously, most runs deadlock in the first iteration, although there was one run that finished correctly on the second iteration (not conforming with the chances of 1 in 1679616 (6**8) of having all threads choosing 0).

lvella
  • 12,754
  • 11
  • 54
  • 106
  • Move `#pragma omp parallel` inside the do loop. OpenMP implementations create a thread pool so there is little cost between each iteration of the do while loop. – Z boson Oct 10 '14 at 18:51
  • And how to keep my private variables created before the loop? The reason I am doing this way is that inside the loop there is another parallel loop (nested parallel) that works on the private variables of each thread organized previously. – lvella Oct 10 '14 at 19:29
  • One nice feature of @Zboson 's suggestion is that you could do an "&" reduction on cond_var. But I'm not seeing why what you're doing isn't working - either flush or barrier (which has an implied flush) should work. Can you post an [MCVE](http://stackoverflow.com/help/mcve) which demonstrates the issue? – Jonathan Dursi Oct 10 '14 at 20:17

2 Answers2

1

The issue is that in the while loop, you are updating the cond_var twice and using it a third time, and you need to ensure that these operations don't interfere with each other. Each loop iteration, the code:

  1. sets cond_var = true (using a non-existant OpenMP pragma, "once", which is ignored and so done by every thread)
  2. Updates cond_var by &ing it with the local condition variable;
  3. Uses the updated-by-everyone cond_var to test whether to exit out of the loop.

Thus, one needs to make sure one thread isn't setting cond_var true (1) while other threads are anding it (2); no threads are still anding (2) while using it to test out of the loop (3); and no threads are testing it (3) while a thread is setting it to true (1).

The obvious way to do that is with barriers, one between each of those three cases - so three barriers. So this works:

#include <omp.h>
#include <random>
#include <chrono>
#include <thread>
#include <iostream>

int main() {

    bool cond_var;

    #pragma omp parallel default(none) shared(cond_var,std::cout)
    {
        bool some_private_var;
        std::random_device rd;
        std::mt19937 rng(rd());
        unsigned iter_count = 0;

        std::uniform_int_distribution<int> dice(0,1);

        const int tid = omp_get_thread_num();
        printf("Thread %d started.\n", tid);
        do {
            ++iter_count;

            #pragma omp barrier
            #pragma omp single 
            cond_var = true;
            // implicit barrier here after the single; turned off with a nowait clause.

            some_private_var = (dice(rng) == 0);

            // If all threads choose 0, cond_var will remain 'true', ending the loop
            #pragma omp atomic update
            cond_var &= some_private_var;

            #pragma omp barrier
        } while(!cond_var);

        #pragma omp critical
        std::cout << "Thread " << tid << " finished with " << iter_count << " iterations." << std::endl;
    }

    return 0;
}

You can do a little bit better by having every thread set only a local variable in a shared array, and having one single thread do the and-ing; so you still need two barriers, one to make sure everyone is done before the anding, and one to make sure the anding is done before testing for completion:

#include <omp.h>
#include <random>
#include <chrono>
#include <thread>
#include <iostream>

int main() {

    bool cond_var;

    const int num_threads = omp_get_max_threads();
    const unsigned int spacing=64/sizeof(bool);  /* to avoid false sharing */
    bool local_cond_var[num_threads*spacing];

    #pragma omp parallel default(none) shared(cond_var,std::cout,local_cond_var)
    {
        std::random_device rd;
        std::mt19937 rng(rd());
        unsigned iter_count = 0;

        std::uniform_int_distribution<int> dice(0,1);

        const int tid = omp_get_thread_num();
        printf("Thread %d started.\n", tid);
        do {
            ++iter_count;

            local_cond_var[tid*spacing] = (dice(rng) == 0);

            #pragma omp barrier
            #pragma omp single
            {
                cond_var = true;
                for (int i=0; i<num_threads; i++)
                    cond_var &= local_cond_var[i*spacing];
            }
            // implicit barrier here after the single; turned off with a nowait clause.
        } while(!cond_var);

        #pragma omp critical
        std::cout << "Thread " << tid << " finished with " << iter_count << " iterations." << std::endl;
    }

    return 0;
}

Note that the barriers, explicit or implicit, imply flushing of the shared variables, and adding a nowait clause to the singles would then cause intermittent deadlocks.

Jonathan Dursi
  • 50,107
  • 9
  • 127
  • 158
  • Hi, I just added an MVCE. Your code seems to work properly, but mine deadlocks (the exit condition depends on all threads, not only one). – lvella Oct 13 '14 at 19:07
0

Putting a #pragma omp barrier after the last statement in the body of the loop did not cause a deadlock for me, but it's not sufficient, either. Although the worker threads will wait at the barrier until they can all pass through together, that does not ensure that they have a consistent view of cond_var on the other side. If on any iteration the first thread(s) to update cond_var leaves it true, then some or all of those threads may perform another iteration despite another thread later setting it false. Only when those threads return to the atomic update are they certain to see the value written by other threads.

You should be able to work around that issue by performing an atomic read of the condition variable after the barrier, before testing the loop condition. You need to do that, or something else to solve the problem, because it violates an OpenMP constraint for different threads in your thread group to reach the barrier different numbers of times. In fact, that is likely the reason for your program hanging: the threads that perform an extra iteration are stuck waiting for the others at the barrier.

John Bollinger
  • 160,171
  • 8
  • 81
  • 157
  • Yes, that is the precisely the point, because they have inconsistent views of the `cond_var`, some threads will never reach the barrier again, deadlocking. But I did try to load the variable with `#pragma atomic read`, and it didn't work. Apparently, atomic read guarantee that I will not have an inconsistent value, but does not guarantee it is the most recent one. – lvella Oct 10 '14 at 19:22
  • Sorry, what's the argument that barrier isn't sufficient? There's an implied flush at the end of every "task scheduling point", which includes barriers. – Jonathan Dursi Oct 10 '14 at 20:12
  • There's an implied flush to memory, but not a corresponding read *from* memory. After passing the barrier, threads that reached the barrier earlier may not see updates that were performed by threads that reached it later. – John Bollinger Oct 10 '14 at 20:27
  • @Ivella: as Jonathan Dursi said, there is an implied flush at the barrier. Any write performed by one thread prior to reaching the barrier, and not subsequently overwritten, must be visible to all threads reading the same memory after passing the barrier. – John Bollinger Oct 10 '14 at 20:33
  • Howevever, depending on how you code the subsequent read, it might not have the effect you think. Try switching your loop condition to use the internal `some_private_var`, which you read back atomically from the shared `cond_var` between the barrier and the bottom of the loop. – John Bollinger Oct 10 '14 at 20:36
  • 1
    Calrification: after passing the barrier, threads that reached the barrier earlier may have a temporary view of memory that does not reflect updates that were performed by threads that reached it later. – John Bollinger Oct 10 '14 at 20:39
  • I suspect that a big part of the problem is the certainly unnecessary, and likely confounding, `cond_var = true` line at the beginning of the loop. It seems like that the ensuing race condition has to be a problem - but without a complete replicator it's hard to know. – Jonathan Dursi Oct 10 '14 at 20:53
  • @JonathanDursi yes, that's weird and unnecessary. The variable could instead be initialized once, in its declaration. But that isn't an essential aspect of the problem, as I was able to demonstrate for myself by experiment. There is in any event no race, however, because the assignment is in a block marked `omp single`, which provides an implicit barrier at its end. – John Bollinger Oct 10 '14 at 21:01