1

I'm using CUDA 6.5 + VS2013 + GTX Titan black. I observe that the following printing codes will crash when the total number of threads larger than 65536. I googled a bit but havent seen anything useful. Does anyone else observe the same behaviour? Or can anyone provide some explanation? Thank you very much!

__global__ void testKernel(int val)
{
    int X = blockDim.x * blockIdx.x + threadIdx.x;
    int Y = blockDim.y * blockIdx.y + threadIdx.y;
    printf("[%d, %d]:\t" "\tValue is:%d\n", X, Y, val);
}

void main(){

    dim3 block(16,16);
    dim3 grid(16,16);
    testKernel << <grid, block >> >(10);
    cudaDeviceSynchronize();
    cudaGetLastError();

    cudaDeviceReset();
}

And I got the following error message when I use block(32,16) and grid(16,16):

Gpu API call (the launch timed out and was terminated)...

Shuda Li
  • 199
  • 1
  • 11
  • 3
    I wasn't able to reproduce the failure. However you might want to know that the in-kernel `printf` feature has significant limits [that you may want to read about](http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#limitations). It's really not designed for large-scale output for a variety of reasons. One in particular is that the buffer for this activity is limited, and when overflowed, the previous buffer data will be lost (i.e. not printed out). Also, rather than saying "will crash", you might be specific about the crash, e.g. paste the crash text into your question. – Robert Crovella Aug 18 '14 at 14:32
  • 2
    When I run your code, even though you would expect 65536 lines of output, I get only 4096 lines of output, due to the buffer limitation. It's possible the crash you are running into is a windows TDR timeout, due to the kernel taking a very long time due to the `printf` from every thread. – Robert Crovella Aug 18 '14 at 14:37
  • Thanks Robert! What is the number of threads you've tried? can you try block(32,32) grid(32,32)? – Shuda Li Aug 18 '14 at 14:40
  • Yes, that kernel takes a very long time and would definitely hit the default windows TDR timeout. I'm not sure what you googled for, but if you google "launch timed out and was terminated" you'll get plenty of instructive hits, like [this one](http://stackoverflow.com/questions/13525530/the-launch-timed-out-and-was-terminated) – Robert Crovella Aug 18 '14 at 14:57

2 Answers2

6

Your kernel is taking too long to execute:

the launch timed out and was terminated

This is a limitation of the windows operating system, when running on WDDM devices.

There are a variety of workarounds possible. Some are:

  1. reduce your kernel execution time
  2. switch the GPU to TCC mode, if possible (not possible with GeForce GPUs).
  3. extend the TDR timeout delay (or remove it) via windows registry modification

Also, the in-kernel printf feature has significant limits. It's really not designed for large-scale output for a variety of reasons. One in particular is that the buffer for this activity is limited, and when overflowed, the previous buffer data will be lost (i.e. not printed out).

Robert Crovella
  • 143,785
  • 11
  • 213
  • 257
  • For how to adjust TDR timeout on windows see http://stackoverflow.com/a/17187135/3242721 – Michal Hosala Aug 18 '14 at 14:55
  • 1
    If you look at the page I linked, you'll find [this microsoft description of TDR with instructions for modification](http://stackoverflow.com/questions/13525530/the-launch-timed-out-and-was-terminated) – Robert Crovella Aug 18 '14 at 14:58
  • The reason for the long time of executing the cuda kernels is probably the printing buffer overflow. – Shuda Li Aug 18 '14 at 15:06
6

Thanks to Robert's answer, I realize that the problem might due to the size of buffer. I use the following codes to find out that by default the size of the printing buffer is 1048576 bytes (1M)

size_t sz;
cudaDeviceGetLimit(&sz, cudaLimitPrintfFifoSize);
std::cout << sz << std::endl;

When I increase the buffer size to 100 Mb using the following codes, the error disappears and I have all expected outputs, 131072 lines in total! (I use block(32,16); .. grid(16,16); ... )

sz = 1048576 * 100;
cudaDeviceSetLimit(cudaLimitPrintfFifoSize, sz);

Somehow, the overflow of the printing buffer causes longer response time than usual and triggers a TDR. When I increase the buffer size accordingly, the codes manage to finish before time out. More importantly, sufficient buffer size ensures no data lost.

But, I think the upper bound of buffer size and execution time depends on devices. It works well on Titan Black does not necessarily mean it also works for other NVidia cards. Again, I agree with Robert that to use printf for exporting large amount of data from CUDA kernels are unreliable in practice. I just use it to dump some info to debug the kernel.

WilliamKF
  • 41,123
  • 68
  • 193
  • 295
Shuda Li
  • 199
  • 1
  • 11
  • I don't think this will make the TDR issue go away for larger kernel sizes, like the block(32,32), grid(32,32) you asked about. So it does improve things somewhat, but it is only pushing out the boundary of failure a small amount. I wouldn't suggest using in-kernel `printf` this way. – Robert Crovella Aug 18 '14 at 15:06
  • When I increase the printing buffer size to 1024M, block(32,32) and grid(32,32) work for me as well. But I agree it's not a good idea to use printf this way for any practical purposes. For my case, I just use printf for debugging. – Shuda Li Aug 18 '14 at 15:16
  • Thanks so much! This solved my issue. Thought I was going crazy for a minute. – thatWiseGuy Feb 26 '17 at 01:16