Some child grids not being executed with CUDA Dynamic Parallelism

Question

I'm experimenting with the new Dynamic Parallelism feature in CUDA 5.0 (GTK 110). I face the strange behavior that my program does not return the expected result for some configurations—not only unexpected, but also a different result with each launch.

Now I think I found the source of my problem: It seems that some child girds (kernels launched by other kernels) are sometimes not executed when too many child grids are spawned at the same time.

I wrote a little test program to illustrate this behavior:

#include <stdio.h>

__global__ void out_kernel(char* d_out, int index)
{
    d_out[index] = 1;
}

__global__ void kernel(char* d_out)
{
    int index = blockIdx.x * blockDim.x + threadIdx.x;
    out_kernel<<<1, 1>>>(d_out, index);
}

int main(int argc, char** argv) {

    int griddim = 10, blockdim = 210;
    // optional: read griddim and blockdim from command line
    if(argc > 1) griddim = atoi(argv[1]);
    if(argc > 2) blockdim = atoi(argv[2]);

    const int numLaunches = griddim * blockdim;
    const int memsize = numLaunches * sizeof(char);

    // allocate device memory, set to 0
    char* d_out; cudaMalloc(&d_out, memsize);
    cudaMemset(d_out, 0, memsize);

    // launch outer kernel
    kernel<<<griddim, blockdim>>>(d_out);
    cudaDeviceSynchronize();

    // dowload results
    char* h_out = new char[numLaunches];
    cudaMemcpy(h_out, d_out, memsize, cudaMemcpyDeviceToHost);

    // check results, reduce output to 10 errors
    int maxErrors = 10;
    for (int i = 0; i < numLaunches; ++i) {
        if (h_out[i] != 1) {
            printf("Value at index %d is %d, should be 1.\n", i, h_out[i]);
            if(maxErrors-- == 0) break;
        }
    }

    // clean up
    delete[] h_out;
    cudaFree(d_out);
    cudaDeviceReset();
    return maxErrors < 10 ? 1 : 0;
}

The program launches a kernel in a given number of blocks (1st parameter) with a given number of threads each (2nd parameter). Each thread in that kernel will then launch another kernel with a single thread. This child kernel will write a 1 in its portion of an output array (which was initialized with 0s).

At the end of execution all values in the output array should be 1. But strangely for some block- and grid-sizes some of the array values are still zero. This basically means that some of the child grids are not executed.

This only happens if many of the child grids are spawned at the same time. On my test system (a Tesla K20x) this is the case for 10 blocks containing 210 threads each. 10 blocks with 200 threads deliver the correct result, though. But also 3 blocks with 1024 threads each cause the error. Strangely, no error is reported back by the runtime. The child grids simply seem to be ignored by the scheduler.

Does anyone else face the same problem? Is this behavior documented somewhere (I did not find anything), or is it really a bug in the device runtime?

score 4 · Accepted Answer · edited Jun 20 '20 at 09:12

You're doing no error checking of any kind that I can see. You can and should do similar error checking on device kernel launches. Refer to the documentation These errors will not necessarily be bubbled up to the host:

Errors are recorded per-thread, so that each thread can identify the most recent error that it has generated.

You must trap them in the device. There are plenty of examples of this type of device error checking in the documentation.

If you were to do proper error checking you would discover that in each case where a kernel failed to launch, the cuda device runtime API was returning error 69, cudaErrorLaunchPendingCountExceeded.

If you scan the documentation for this error, you'll find this:

cudaLimitDevRuntimePendingLaunchCount

Controls the amount of memory set aside for buffering kernel launches which have not yet begun to execute, due either to unresolved dependencies or lack of execution resources. When the buffer is full, launches will set the thread’s last error to cudaErrorLaunchPendingCountExceeded. The default pending launch count is 2048 launches.

At 10 blocks * 200 threads, you are launching 2000 kernels, and things seem to work.

At 10 blocks * 210 threads, you are launching 2100 kernels, which exceeds the 2048 limit mentioned above.

Note that this is somewhat dynamic in nature; depending on how your application launches child kernels, you may launch in excess of 2048 kernels easily without hitting this limit. But since your application launches all kernels approximately simultaneously, you are hitting the limit.

Proper cuda error checking is advisable any time your CUDA code is not behaving the way you expect.

If you'd like to get some confirmation of the above, in your code you can modify your main kernel like this:

__global__ void kernel(char* d_out)
{
    int index = blockIdx.x * blockDim.x + threadIdx.x;
    out_kernel<<<1, 1>>>(d_out, index);
//    cudaDeviceSynchronize();  // not necessary since error 69 is returned immediately
    cudaError_t err = cudaGetLastError();
    if (err != cudaSuccess) d_out[index] = (char)err;
}

The pending launch count limit is modifiable. Refer to the documentation for cudaLimitDevRuntimePendingLaunchCount

That makes perfect sense, thanks for the answer! I did not know that one can use `cudaGetLastError()` _inside_ a kernel. I also found that it is possible to increase the pending launch count using `cudaDeviceSetLimit(cudaLimitDevRuntimePendingLaunchCount, )`. It would be great if you could add that to your answer. Thanks again! — Frank Rupprecht, Jul 28 '13 at 11:16

Some child grids not being executed with CUDA Dynamic Parallelism

1 Answers1

Linked