I'm experimenting with the new Dynamic Parallelism feature in CUDA 5.0 (GTK 110). I face the strange behavior that my program does not return the expected result for some configurations—not only unexpected, but also a different result with each launch.
Now I think I found the source of my problem: It seems that some child girds (kernels launched by other kernels) are sometimes not executed when too many child grids are spawned at the same time.
I wrote a little test program to illustrate this behavior:
#include <stdio.h>
__global__ void out_kernel(char* d_out, int index)
{
d_out[index] = 1;
}
__global__ void kernel(char* d_out)
{
int index = blockIdx.x * blockDim.x + threadIdx.x;
out_kernel<<<1, 1>>>(d_out, index);
}
int main(int argc, char** argv) {
int griddim = 10, blockdim = 210;
// optional: read griddim and blockdim from command line
if(argc > 1) griddim = atoi(argv[1]);
if(argc > 2) blockdim = atoi(argv[2]);
const int numLaunches = griddim * blockdim;
const int memsize = numLaunches * sizeof(char);
// allocate device memory, set to 0
char* d_out; cudaMalloc(&d_out, memsize);
cudaMemset(d_out, 0, memsize);
// launch outer kernel
kernel<<<griddim, blockdim>>>(d_out);
cudaDeviceSynchronize();
// dowload results
char* h_out = new char[numLaunches];
cudaMemcpy(h_out, d_out, memsize, cudaMemcpyDeviceToHost);
// check results, reduce output to 10 errors
int maxErrors = 10;
for (int i = 0; i < numLaunches; ++i) {
if (h_out[i] != 1) {
printf("Value at index %d is %d, should be 1.\n", i, h_out[i]);
if(maxErrors-- == 0) break;
}
}
// clean up
delete[] h_out;
cudaFree(d_out);
cudaDeviceReset();
return maxErrors < 10 ? 1 : 0;
}
The program launches a kernel in a given number of blocks (1st parameter) with a given number of threads each (2nd parameter). Each thread in that kernel will then launch another kernel with a single thread. This child kernel will write a 1 in its portion of an output array (which was initialized with 0s).
At the end of execution all values in the output array should be 1. But strangely for some block- and grid-sizes some of the array values are still zero. This basically means that some of the child grids are not executed.
This only happens if many of the child grids are spawned at the same time. On my test system (a Tesla K20x) this is the case for 10 blocks containing 210 threads each. 10 blocks with 200 threads deliver the correct result, though. But also 3 blocks with 1024 threads each cause the error. Strangely, no error is reported back by the runtime. The child grids simply seem to be ignored by the scheduler.
Does anyone else face the same problem? Is this behavior documented somewhere (I did not find anything), or is it really a bug in the device runtime?