In the CUDA Programming Guide in the section about Cooperative Groups, there is an example of grid-local synchronization:
grid_group grid = this_grid();
grid.sync();
Unfortunately, I didn't found precise definition of grid.sync()
behavior. Is it correct to take the following definition given for __syncthreads
and extend it to grid level?
void __syncthreads();
waits until all threads in the thread block have reached this point and all global and shared memory accesses made by these threads prior to __syncthreads() are visible to all threads in the block.
So, my question is this correct:
this_grid().sync();
waits until all threads in the grid have reached this point and all global and shared memory accesses made by these threads prior to this_grid().sync() are visible to all threads in the grid.
I doubt the correctness of this because in the CUDA Programming Guide, a couple of lines below grid.sync();
there is the following statement:
To guarantee the co-residency of the thread blocks on the GPU, the number of blocks launched needs to be carefully considered.
Does it mean that if I use so many threads so that there is no co-residency of thread blocks, I can end up in the situation where threads can deadlock?
The same question arises when I try to use coalesced_threads().sync()
. Is the following correct?
coalesced_threads().sync();
waits until all active threads in the warp have reached this point and all global and shared memory accesses made by these threads prior to coalesced_threads().sync() are visible to all threads in the list of active threads of warp.
Does the following example exits from while loop?
auto ct = coalesced_threads();
assert(ct.size() == 2);
b = 0; // shared between all threads
if (ct.thread_rank() == 0)
while (b == 0) {
// what if only rank 0 thread is always taken due to thread divergence?
ct.sync(); // does it guarantee that rank 0 will wait for rank 1?
}
if (ct.thread_rank() == 1)
while (b == 0) {
// what if a thread with rank 1 never executed?
b = 1;
ct.sync(); // does it guarantee that rank 0 will wait for rank 1?
}
To make the example above clear, without ct.sync()
it is unsafe and can deadlock (loop infinitely):
auto ct = coalesced_threads();
assert(ct.size() == 2);
b = 0; // shared between all threads
if (ct.thread_rank() == 0)
while (b == 0) {
// what if only rank 0 thread is always taken due to thread divergence?
}
if (ct.thread_rank() == 1)
while (b == 0) {
// what if a thread with rank 1 never executed?
b = 1;
}