Is there any workaround for this OpenMP barrier?

Question

I have this parallel region written in OpenMp:

std::vector<T> sharedResult;
#pragma omp parallel
{
std::vector<T> result;
#pragma omp for nowait
for(int i=0; i<n; i++){
  //fill result
}
#pragma omp critical{
  sharedResult.insert(sharedResult.end(), result.begin(), result.end());
}
#pramga omp barrier
#pragma omp for nowait
for(size_t i=0; i<sharedResult.size(); i++){
  foo(sharedResult[i]);
}
...
}

I'm afraid that the #pragma omp barrier is necessary. The reason I think is that otherwise when a thread hit the last #pragma omp for, sharedResult.size() at that moment is still not in his final state (obtained when the previous parallel for is finished). Notice that unfortunately sharedResult's size is previously unknown.

Unfortunately, I've noticed that this barrier generates a big overhead, i.e. one particular iteration is more expensive than all the others, so all the threads have to wait for the thread which executes that iteration. This can be considered as load imbalance, but I didn't find any solution to solve this.

So my question is: is there any way to start the last parallel for without waiting that the previous one is completed or there is seriously no way to improve this?

@Zulan thanks for your comment. This is part of a computer vision algorithm and both highly depends on the input image, but we are talking about thousands of elements in both cases. — justHelloWorld, May 23 '17 at 11:24
Zulan's second solution is what I thought of. Although I was just trying to give an idea it seems more complex than I thought. Let us now delete our comments and please delete the "possible solution" part as it is not correct and not really part of your question :) — BlameTheBits, May 23 '17 at 16:20
@Shadow I think you're right, I deleted my previous comments — justHelloWorld, May 23 '17 at 20:56

Zulan · Accepted Answer · 2018-11-18T11:09:48.947

I would agree that the barrier is necessary. I see several ways out, with increasing complexity and likely increasing efficiency:

Tasks

Post a task for each result element:

#pragma omp parallel
{
    std::vector<T> result;
    #pragma omp for nowait
    for(int i=0; i<n; i++){
        //fill result
    }
    // would prefer a range based loop here, but
    // there seem to be issues with passing references 
    // to tasks in certain compilers
    for(size_t i=0; i<result.size(); i++){
    {
        #pragma omp task
        foo(result[i]);
    }
}

You could even post the task within the initial loop. If there are too many tasks, you might get a significant overhead.

Processing the result queue with finished threads

Now this one is trickier - in particular you need to distinguish between result queue empty and all threads completing their first loop.

std::vector<T> sharedResult;
int threadsBusy;
size_t resultIndex = 0;
#pragma omp parallel
{
    #pragma omp single
    threadsBusy = omp_num_threads();

    std::vector<T> result;
    #pragma omp for nowait
    for(int i=0; i<n; i++){
        //fill result
    }

    #pragma omp critical
    {
        sharedResult.insert(sharedResult.end(), result.begin(), result.end());
        threadsBusy--;
    }

    do {
        bool hasResult, allThreadsDone;
        // We need a copy here as the vector may be resized
        // and elements may become invalid by insertion
        T myResult;
        #pragma omp critical
        {
            if (resultIndex < sharedResult.size()) {
                resultIndex++;
                hasResult = true;
                myResult = sharedResult[myResult];
            } else {
                hasResult = false;
            }
            allThreadsDone = threadsBusy == 0;
        }
        if (hasResult) {
            foo(myResult);
        } else {
            if (allThreadsDone) {
                break;
            }
            // If we just continue here, we will spin on the mutex
            // Unfortunately there are no condition variables in OpenMP
            // So instead we go for a quick nap as a compromise
            // Feel free to tune this accordingly
            std::this_thread::sleep_for(10ms);
        }
    } while (true);
}

Note: Usually I test the code I post here, but I couldn't due to the lack of a complete example.

Process results in chunks via parallel loops

Finally, you could run parallel for loops multiple times for those results that are already done. However that has a number of issues. First, each worksharing region must be encountered by all threads even by the ones that complete the first one late. So you would have to keep track of the loops you run. Also the loop bound needs to be the same for each thread - and you must only read sharedResult.size() in a critical section. So you have to read that beforehand to a shared variable by one thread in a critical section, but wait with all threads until it is properly read. Further you would have to use dynamic scheduling, otherwise it you would likely use static scheduling and you will wait on the threads that complete last anyway. You edited example does neither of these things. I wouldn't take it for granted that a for nowait schedule(dynamic) can complete before all threads in a team enter it (but it works with libgomp). All things considered, I wouldn't really go there.

Thanks for your useful answer, I really appreciate it. The second solution looks the best one for me (even though I've never used `#pragma omp` task). But why don't we use a shared `std::queue` for `sharedResult` instead of `std::vector` combined with `allThreadsDone` and the quick nap? — justHelloWorld, May 23 '17 at 15:27
With `std::queue` you have to worry about keeping the result elements alive somewhere after `pop`. Since I know nothing about `T` and `foo`, I am not more specific with respect to lifetime, ownership and data-sharing attributes. You might have to tune that to optimally pass the result to the tasks. — Zulan, May 23 '17 at 15:48
Your solution greatly improved the results, thank you so much! However, the only "ugly" part of this code is the absence "take a nap" part. Why can't we use a `std::condition_variable`? I guess that would be much more performant than the nap approach, don't you think? — justHelloWorld, May 25 '17 at 13:19
In my answers for [OpenMP] questions I try to stick with pure OpenMP, since unfortunately OpenMP technically doesn't support [C++11 and it's parallel primitives](https://stackoverflow.com/a/41316118/620382). That said it may be practical and clean to use `std::condition_variable`. — Zulan, May 25 '17 at 19:35
Or even `concurrent_bounded_queue`! In my implementation I obtained the best performances by using it — justHelloWorld, May 26 '17 at 08:41
I think there is a race condition in the second solution: `sharedResult.insert` is called inside a critical section. That's fine. But `sharedResult[myIndex]` is called outside a critical section. The reference returned by `sharedResult[myIndex]` may become invalid any moment, because `sharedResult.insert` may reallocate. To fix this, we have to do a bit more work in the second critical section: don't just get the index, but the actual element from the vector. Note: We need to copy or move the actual element (type `T`), not just the reference returned by `sharedResult[myIndex]`. — jcsahnwaldt Reinstate Monica, Nov 18 '18 at 08:55
Thanks, excellent point. I updated the answer. An alternative to copying would be a reference-stable collection such as `std::deque` (but only using index would still be wrong with `std::deque` I believe). — Zulan, Nov 18 '18 at 11:10

Is there any workaround for this OpenMP barrier?

1 Answers1

Tasks

Processing the result queue with finished threads

Process results in chunks via parallel loops