1

Does OpenMP with target offloading on the GPU include a global memory fence / global barrier, similar to OpenCL?

barrier(CLK_GLOBAL_MEM_FENCE);

I've tried using inside a teams construct

#pragma omp target teams
{
    // Some initialization...

    #pragma omp distribute parallel for
    for (size_t i = 0; i < N; i += 1)
    {
        // Some work...
    }

    #pragma omp barrier

    #pragma omp distribute parallel for
    for (size_t i = 0; i < N; i += 1)
    {
        // Some other work depending on pervious loop
    }
}

However it seams that the barrier only works within a team, equivalent to:

barrier(CLK_LOCAL_MEM_FENCE);

I would like to avoid splitting the kernel into two, to avoid sending team local data to global memory just to load it again.

Edit: I've been able enforce the desired behavior using a global atomic counter and busy waiting of the teams. However this doesn't seem like a good solution, and I'm still wondering if there is a better way to do this using proper OpenMP

1 Answers1

0

A barrier construct only synchronizes threads in the current team. Synchronization between threads from different thread teams launched by a teams construct is not available. OpenMP's execution model doesn't guarantee that such threads will even execute concurrently, so using atomic constructs to synchronize between the threads will not work in general:

Whether the initial threads concurrently execute the teams region is unspecified, and a program that relies on their concurrent execution for the purposes of synchronization may deadlock.

Note that the OpenCL barrier call only provides synchronization within a workgroup, even with the CLK_GLOBAL_MEM_FENCE argument. See Barriers in OpenCL for more information on semantics of CLK_GLOBAL_MEM_FENCE versus CLK_LOCAL_MEM_FENCE.

  • This is really interesting, I've seen multiple implementations using barrier(CLK_GLOBAL_MEM_FENCE) as a global barrier, but it looks like you're right. Do you know of any way to ensure the behavior of a global barrier in OpenMP? besides splitting the code into multiple kernels. – Mathias Gammelmark Sep 26 '22 at 10:24