Understanding Dynamic Parallelism in CUDA

Question

Example of dynamic parallelism:

__global__ void nestedHelloWorld(int const iSize,int iDepth) {
    int tid = threadIdx.x;
    printf("Recursion=%d: Hello World from thread %d" "block %d\n",iDepth,tid,blockIdx.x);
    // condition to stop recursive execution
    if (iSize == 1) return;
    // reduce block size to half
    int nthreads = iSize>>1;
    // thread 0 launches child grid recursively
    if(tid == 0 && nthreads > 0) {
        nestedHelloWorld<<<1, nthreads>>>(nthreads,++iDepth);
        printf("-------> nested execution depth: %d\n",iDepth);
    }
}

Prints with one block, with two blocks the entire parent grid has finished:

./nestedHelloWorld Execution Configuration: grid 1 block 8
Recursion=0: Hello World from thread 0 block 0
Recursion=0: Hello World from thread 1 block 0
Recursion=0: Hello World from thread 2 block 0
Recursion=0: Hello World from thread 3 block 0
Recursion=0: Hello World from thread 4 block 0
Recursion=0: Hello World from thread 5 block 0
Recursion=0: Hello World from thread 6 block 0
Recursion=0: Hello World from thread 7 block 0
-------> nested execution depth: 1
Recursion=1: Hello World from thread 0 block 0
Recursion=1: Hello World from thread 1 block 0
Recursion=1: Hello World from thread 2 block 0
Recursion=1: Hello World from thread 3 block 0
-------> nested execution depth: 2
Recursion=2: Hello World from thread 0 block 0
Recursion=2: Hello World from thread 1 block 0
-------> nested execution depth: 3
Recursion=3: Hello World from thread 0 block 0

Say I launch a child grid from one thread in a block at threadIdx.x==0. Can I assume that all other threads in the parent grid have finished executing up to the point I launched the child grid as well?

If so, how does this work? All I'm reading is that a parent grid is not technically finished before a child grid. Nothing about guarantees of other parent threads that have not launched children.

score 4 · Accepted Answer · answered Jun 12 '15 at 01:41

Say I launch a child grid from one thread in a block at threadIdx.x==0. Can I assume that all other threads in the parent grid have finished executing up to the point I launched the child grid as well?

No. You can make no assumptions about the state of other threads in the parent block or other blocks in the parent grid.

If so, how does this work? All I'm reading is that a parent grid is not technically finished before a child grid. Nothing about guarantees of other parent threads that have not launched children.

When a parent thread launches a child grid it pushes work to the GPU at a higher priority than itself. On compute capability 3.5 - 5.x the GPU will schedule the highest priority work but it will not pre-empt any running blocks. If the GPU is full then the compute work distribution will not be able schedule the child blocks. As parent blocks complete the child blocks will be distributed before any new parent blocks. At this point the design could still dead lock. If the block that launched the work does a join operation (cudaDeviceSynchronize) and if the children work has not be completed because there was not sufficient room to schedule the child work or it is still running then the parent block (not grid) will pre-empt itself. This allows for the child grid to make forward progress. The CDP scheduler will restore the parent block when the child grid has completed.

The parent grid will not be marked as completed until all blocks from the parent complete and all child grids complete.

If the parent grid launches a child grid but does not join it is possible that all parent blocks completes before the child blocks are scheduled.
If the parent grid joins then it is likely that all children grid complete before the parent blocks complete.
If the parent grid launch is more than can be concurrently executed by the GPU then the answer is in the middle.

Nsight VSE CUDA Trace and Visual Profiler have additional visualizers for tracing CDP grids. The video (but not slides) from the GTC 2013 presentation Profiling and Optimizing CUDA Kernel Code with NVIDIA Nsight Visual Studio Edition provide the best documentation on CDP visualization. Start watching at time 17:15.

Thanks for the detailed explanation. Could you elaborate on what it means for a parent grid to "join" a child grid and what code is invoked for this to happen? I read somewhere that it is implicit if not made explicit but everything still wasn't really clear. — John, Jun 12 '15 at 02:13
[Fork-join](https://en.wikipedia.org/wiki/Fork%E2%80%93join_model) is a common model in parallel programming. In CDP a fork is a child launch and a join is a synchronization primitive on children. In the case above cudaDeviceSynchronize() can be used to wait for the children work to complete. — Greg Smith, Jun 18 '15 at 00:48

score 0 · Answer 2 · answered Jun 11 '15 at 11:35

0

No. All threads in a warp execute in lock-step, so if thread 0 has not yet finished, neither have threads [1..31]. Other threads (or warps) in the block may or may not have finished executing.

answered Jun 11 '15 at 11:35

Avi Ginsburg

10,323
3
29
56

I understand this. But when I increase it to 2 blocks or more, both print before the child does. Is that just random? See here: http://www.turkpaylasim.com/cevahir/2015/04/20/cudada-kernel-icinde-kernel/ – John Jun 11 '15 at 11:37
My Turkish isn't great, but from what I gather, you have the same output as a single block from each block interleaved. Within each "root" block the threads behave as expected. Within each child block (all have an id of `0`) the threads are still in lock-step. – Avi Ginsburg Jun 11 '15 at 11:46
Haha. I can't speak Turkish either just wanted to show the example for 2 blocks. Not sure if I understand what you're saying correctly. Does this mean that the entire parent grid will finish executing before the child grid is launched? Based on what you said, no it doesn't. Can I expect that block 0 from the root may be executed after child 0 from block 1? – John Jun 11 '15 at 11:53
I should quote [talonmies](http://stackoverflow.com/a/10587358/2899559) and say "You should also be aware that the compiler and assembler do perform instruction re-ordering...", plus the fact that I wouldn't trust the order of `printf`s between threads. – Avi Ginsburg Jun 11 '15 at 13:08

Understanding Dynamic Parallelism in CUDA

2 Answers2