1

In the CUDA Programming Guide in the section about Cooperative Groups, there is an example of grid-local synchronization:

grid_group grid = this_grid();
grid.sync();

Unfortunately, I didn't found precise definition of grid.sync() behavior. Is it correct to take the following definition given for __syncthreads and extend it to grid level?

void __syncthreads();

waits until all threads in the thread block have reached this point and all global and shared memory accesses made by these threads prior to __syncthreads() are visible to all threads in the block.

So, my question is this correct:

this_grid().sync();

waits until all threads in the grid have reached this point and all global and shared memory accesses made by these threads prior to this_grid().sync() are visible to all threads in the grid.

I doubt the correctness of this because in the CUDA Programming Guide, a couple of lines below grid.sync(); there is the following statement:

To guarantee the co-residency of the thread blocks on the GPU, the number of blocks launched needs to be carefully considered.

Does it mean that if I use so many threads so that there is no co-residency of thread blocks, I can end up in the situation where threads can deadlock?

The same question arises when I try to use coalesced_threads().sync(). Is the following correct?

coalesced_threads().sync();

waits until all active threads in the warp have reached this point and all global and shared memory accesses made by these threads prior to coalesced_threads().sync() are visible to all threads in the list of active threads of warp.

Does the following example exits from while loop?

auto ct = coalesced_threads();
assert(ct.size() == 2);
b = 0; // shared between all threads
if (ct.thread_rank() == 0)
while (b == 0) {
    // what if only rank 0 thread is always taken due to thread divergence?
    ct.sync(); // does it guarantee that rank 0 will wait for rank 1?
}
if (ct.thread_rank() == 1)
while (b == 0) {
    // what if a thread with rank 1 never executed?
    b = 1; 
    ct.sync(); // does it guarantee that rank 0 will wait for rank 1?
}

To make the example above clear, without ct.sync() it is unsafe and can deadlock (loop infinitely):

auto ct = coalesced_threads();
assert(ct.size() == 2);
b = 0; // shared between all threads
if (ct.thread_rank() == 0)
while (b == 0) {
    // what if only rank 0 thread is always taken due to thread divergence?
}
if (ct.thread_rank() == 1)
while (b == 0) {
    // what if a thread with rank 1 never executed?
    b = 1; 
}
hpd
  • 475
  • 1
  • 4
  • 16

2 Answers2

2

So, my question is this correct:

this_grid().sync();

waits until all threads in the grid have reached this point and all global and shared memory accesses made by these threads prior to this_grid().sync() are visible to all threads in the grid.

Yes, that is correct, assuming you have a proper cooperative launch. A proper cooperative launch implies a number of things:

  1. the cooperative launch property is true on the GPU you are running on
  2. you have launched using a properly formed cooperative launch
  3. you have met grid sizing requirements for a cooperative launch
  4. after the cooperative launch, cudaGetLastError() returns cudaSuccess

Does it mean that if I use so many threads so that there is no co-residency of thread blocks

If you violate the requirements for a cooperative launch, you are exploring undefined behavior. There is no point trying to definitively answer such questions, except to say that the behavior is undefined.

Regarding your statement(s) about coalesced threads, they are correct, although the wording must be understood carefully. active threads for a particular instruction is the same as coalesced threads.

In your example, you are creating an illegal case:

auto ct = coalesced_threads();
assert(ct.size() == 2); //there are exactly 2 threads in group ct
b = 0; // shared between all threads
if (ct.thread_rank() == 0) // this means that only thread whose rank is zero can participate in the next instruction - by definition you have excluded 1 thread
while (b == 0) {  
    // what if only rank 0 thread is always taken due to thread divergence?
    // it is illegal to request a synchronization of a group of threads when your conditional code prevents one or more threads in the group from participating
    ct.sync(); // does it guarantee that rank 0 will wait for rank 1?
}

two different .sync() statements, in different places in the code, cannot satisfy the requirements of a single sync barrier. They each represent an individual barrier, whose requirements must be properly met.

Due to the illegal coding, this example also has undefined behavior; the same comments apply.

Community
  • 1
  • 1
Robert Crovella
  • 143,785
  • 11
  • 213
  • 257
  • Could you provide example for violation of "3. you have met grid sizing requirements for a cooperative launch"? Provided 1,2,4 are satisfied, how I can violate 3? Also, does "no co-residency of thread blocks" always violate some of these 4 criteria or it is possible to use "no co-residency of thread blocks" and still have all 4 criteria satisfied? –  Nov 29 '19 at 21:27
  • I didn't say anywhere that if you have 1,2,4 satisfied, that you could still violate 3. – Robert Crovella Nov 29 '19 at 21:31
  • Is the statement "two different .sync() statements, in different places in the code, cannot satisfy the requirements of a single sync barrier" mentioned somewhere in CUDA docs? I have no doubt it is true, but it would be great to read more details about these semantics to be able to solve such questions later. –  Nov 29 '19 at 21:33
  • I am asking about violating only 3, because I didn't grasp what is the meaning of it and what cases it covers. What are exactly "grid sizing requirements"? Do we have such? –  Nov 29 '19 at 21:35
  • Yes, there are grid sizing requirements. See [here](https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#grid-synchronization-cg). Note that one-block per SM is an acceptable sizing strategy, and a maximal sizing strategy typically depends on use of the occupancy API. You may also wish to refer to any of the CUDA sample codes that use cooperative launch. – Robert Crovella Nov 29 '19 at 21:46
  • Regarding the barrier question, refer to the generalized statement given for `__syncthreads()` [here](https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#synchronization-functions): "`__syncthreads()` is allowed in conditional code but only if the conditional evaluates identically across the entire thread block, " Does that describe your example? It does not. In your case, for a specific `ct.sync()` statement, the conditional code surrounding that statement does not evaluate identically across the `ct` group. – Robert Crovella Nov 29 '19 at 21:49
  • Thank you for such detailed answers! Unfortunately "grid sizing requirements" are still seems not clearly defined in docs. I see the sentence "To guarantee co-residency... ", but does it mean that I **must** guarantee co-residency to satisfy grid-sizing requirements or it is optional? –  Nov 29 '19 at 22:06
  • You must satisfy the grid sizing requirements. They are not optional. The goal is to guarantee co-residency, which means the number of blocks you launch in your cooperative launch must all be able to fit **simultaneously** on the SMs of your GPU. If you fail to meet that condition, you have performed an illegal cooperative launch. You can meet that condition using the occupancy API, as indicated in the docs and cuda sample codes. – Robert Crovella Nov 29 '19 at 22:11
0

Does it mean that if I use so many threads so that there is no co-residency of thread blocks

In this case the execution will error with the following message.

too many blocks in cooperative launch
hpd
  • 475
  • 1
  • 4
  • 16