I'm using CUDA 6.5 + VS2013 + GTX Titan black. I observe that the following printing codes will crash when the total number of threads larger than 65536. I googled a bit but havent seen anything useful. Does anyone else observe the same behaviour? Or can anyone provide some explanation? Thank you very much!
__global__ void testKernel(int val)
{
int X = blockDim.x * blockIdx.x + threadIdx.x;
int Y = blockDim.y * blockIdx.y + threadIdx.y;
printf("[%d, %d]:\t" "\tValue is:%d\n", X, Y, val);
}
void main(){
dim3 block(16,16);
dim3 grid(16,16);
testKernel << <grid, block >> >(10);
cudaDeviceSynchronize();
cudaGetLastError();
cudaDeviceReset();
}
And I got the following error message when I use block(32,16) and grid(16,16):
Gpu API call (the launch timed out and was terminated)...