I would agree that the barrier is necessary. I see several ways out, with increasing complexity and likely increasing efficiency:
Tasks
Post a task for each result element:
#pragma omp parallel
{
std::vector<T> result;
#pragma omp for nowait
for(int i=0; i<n; i++){
//fill result
}
// would prefer a range based loop here, but
// there seem to be issues with passing references
// to tasks in certain compilers
for(size_t i=0; i<result.size(); i++){
{
#pragma omp task
foo(result[i]);
}
}
You could even post the task within the initial loop. If there are too many tasks, you might get a significant overhead.
Processing the result queue with finished threads
Now this one is trickier - in particular you need to distinguish between result queue empty and all threads completing their first loop.
std::vector<T> sharedResult;
int threadsBusy;
size_t resultIndex = 0;
#pragma omp parallel
{
#pragma omp single
threadsBusy = omp_num_threads();
std::vector<T> result;
#pragma omp for nowait
for(int i=0; i<n; i++){
//fill result
}
#pragma omp critical
{
sharedResult.insert(sharedResult.end(), result.begin(), result.end());
threadsBusy--;
}
do {
bool hasResult, allThreadsDone;
// We need a copy here as the vector may be resized
// and elements may become invalid by insertion
T myResult;
#pragma omp critical
{
if (resultIndex < sharedResult.size()) {
resultIndex++;
hasResult = true;
myResult = sharedResult[myResult];
} else {
hasResult = false;
}
allThreadsDone = threadsBusy == 0;
}
if (hasResult) {
foo(myResult);
} else {
if (allThreadsDone) {
break;
}
// If we just continue here, we will spin on the mutex
// Unfortunately there are no condition variables in OpenMP
// So instead we go for a quick nap as a compromise
// Feel free to tune this accordingly
std::this_thread::sleep_for(10ms);
}
} while (true);
}
Note: Usually I test the code I post here, but I couldn't due to the lack of a complete example.
Process results in chunks via parallel loops
Finally, you could run parallel for loops multiple times for those results that are already done. However that has a number of issues. First, each worksharing region must be encountered by all threads even by the ones that complete the first one late. So you would have to keep track of the loops you run. Also the loop bound needs to be the same for each thread - and you must only read sharedResult.size()
in a critical section. So you have to read that beforehand to a shared variable by one thread in a critical section, but wait with all threads until it is properly read. Further you would have to use dynamic scheduling, otherwise it you would likely use static scheduling and you will wait on the threads that complete last anyway. You edited example does neither of these things. I wouldn't take it for granted that a for nowait schedule(dynamic)
can complete before all threads in a team enter it (but it works with libgomp). All things considered, I wouldn't really go there.