1

This question is similar as GLSL memoryBarrierShared() usefulness? .

However I wonder when do we have to use subgroupMemoryBarrier and similar functions since the subgroupBarrier performs both an execution and a memory barrier. For the memoryBarrierfunction I understand, because barrier function does not perform a memory barrier. so you must use both :

memoryBarrier(); // memoryBarrierShared, Buffer, Image...
barrier();

But I do not know when can I use subgroupMemoryBarrier because it is already done by the subgroupBarrier.

GL_KHR_shader_subgroup extension

The function subgroupBarrier() enforces that all active invocations within a subgroup must execute this function before any are allowed to continue their execution, and the results of any memory stores performed using coherent variables performed prior to the call will be visible to any future coherent access to the same memory performed by any other shader invocation within the same subgroup.

I don't think they have made these functions if they are not useful. So I wonder when do we need to use them?

Is it because on a subgroup, it is assumed that they run in parallel, so, you can just issue a subgroupMemoryBarrier. But in this case, when do you have to use subgroupBarrier?

Antoine Morrier
  • 3,930
  • 16
  • 37

1 Answers1

1

There are two very different behaviours here, MemoryBarrier() and Barrier(). They both have barrier in the name, but really they have totally different effects.

Memory barriers are designed to ensure some relatively ordering of memory within the scope of a single thread of execution (e.g. a single compute work item). Memory accesses from before the barrier must have completed before any access after the barrier are allowed to take place. In traditional CPU code this is useful for things like locks - e.g. make sure the lock is successfully taken and written to memory before you touch the structure which it protected. The execution of the threads inside the subgroup relative to each other is not impacted so you can run things in parallel without draining out the pipe, and one thread in the subgroup can run code from before the memory barrier while another is running code from after the memory barrier.

Full barriers are designed to realign execution across the subgroup. No thread in the subgroup can run any code from after the barrier until all threads have reached the barrier, which implicitly means that they also provides memory barrier semantics. This is what you need when you want to rely on lockless algorithms where one thread needs to make assumptions about where another thread in the subgroup has reached. For example, waiting for the thread for localInvocation 0 to populate local memory.

solidpixel
  • 10,688
  • 1
  • 20
  • 33
  • 1
    I do understand the need of both execution and memory barriers. What I do not understand is the necessity to have only a memory barrier within a subgroup since the subgroup barrier already performs both execution and memory barrier. In a block or group (not a subgroup), both the `memoryBarrier` and `barrier` functions are needed because `barrier` does not perform a memory barrier, only an execution barrier. In subgroup, however, the `subgroupBarrier` function perform both execution and memory barrier. – Antoine Morrier Jan 07 '19 at 14:05
  • Thus, I do not know when can I use a `subGroupMemoryBarrier` since it does not make sense to me to perform a memory barrier without knowing if other threads have finished their operations. For me, both barriers should be done, and to achieve higher performance, `subgroupBarrier` should not make any memory barrier but only an execution barrier, like `barrier` function – Antoine Morrier Jan 07 '19 at 14:07
  • As per my example, if you have any code which depends on ordering of two memory accesses within a single thread then you need a memory barrier (e.g. if you touch some memory and then increment an atomic to indicate that the modification has been made). If other threads only care about the fact that the update has been made, but not by which thread or need tight temporal synchronization, then a full `barrier()` would be worse for performance for no benefit. – solidpixel Jan 08 '19 at 08:50
  • In terms of "what should a barrier do", as per the answer to your earlier question here: https://stackoverflow.com/questions/39393560/glsl-memorybarriershared-usefulness note that OpenGL and OpenGL ES have not been consistent here. It's messy because of history rather than any specific purpose ... – solidpixel Jan 08 '19 at 08:54
  • You said `if you have any code which depends on ordering of two memory accesses within a single thread then you need a memory barrier (e.g. if you touch some memory and then increment an atomic to indicate that the modification has been made)`. This kind of thing is not awful on a GPU architecture? because it is a kind of "mutex", no? And since implementations can run in parallel, we can have a deadlock as for : [Cuda Mutex DeadLock](https://stackoverflow.com/questions/31194291/cuda-mutex-why-deadlock) – Antoine Morrier Jan 08 '19 at 09:37
  • Yes, locks specifically are horrible on GPUs, but a good example of the kind of use case where memory ordering is a problem and where thread ordering isn't. – solidpixel Jan 08 '19 at 16:22
  • Okay, I begin to understand the idea of _having only a memory barrier_ :) – Antoine Morrier Jan 08 '19 at 16:24