Does OpenMP with target offloading on the GPU include a global memory fence / global barrier, similar to OpenCL?
barrier(CLK_GLOBAL_MEM_FENCE);
I've tried using inside a teams construct
#pragma omp target teams
{
// Some initialization...
#pragma omp distribute parallel for
for (size_t i = 0; i < N; i += 1)
{
// Some work...
}
#pragma omp barrier
#pragma omp distribute parallel for
for (size_t i = 0; i < N; i += 1)
{
// Some other work depending on pervious loop
}
}
However it seams that the barrier only works within a team, equivalent to:
barrier(CLK_LOCAL_MEM_FENCE);
I would like to avoid splitting the kernel into two, to avoid sending team local data to global memory just to load it again.
Edit: I've been able enforce the desired behavior using a global atomic counter and busy waiting of the teams. However this doesn't seem like a good solution, and I'm still wondering if there is a better way to do this using proper OpenMP