What is the canonical way to check for errors using the CUDA runtime API?

Question

Looking through the answers and comments on CUDA questions, and in the CUDA tag wiki, I see it is often suggested that the return status of every API call should checked for errors. The API documentation contains functions like cudaGetLastError, cudaPeekAtLastError, and cudaGetErrorString, but what is the best way to put these together to reliably catch and report errors without requiring lots of extra code?

NVIDIA's CUDA [samples](http://docs.nvidia.com/cuda/cuda-samples/) contains a header, helper_cuda.h, that has macros called `getLastCudaError` and `checkCudaErrors`, which do pretty much what is described in [the accepted answer](http://stackoverflow.com/a/14038590/2778484). See the samples for demonstrations. Just choose to [install the samples along with the toolkit](https://developer.nvidia.com/cuda-downloads) and you will have it. — chappjc, Sep 03 '14 at 01:16
@chappjc I do not think this question and answer pretends to be original, if this is what you mean, but it has the merit to have educated people using CUDA error checking. — Vitality, Feb 18 '15 at 17:51
@JackOLantern No, that's not what I was implying. This Q&A was very helpful to me and it's certainly easier to find than some header in the SDK. I thought it was valuable to point out this is also how NVIDIA handles it and where to look for more. I'd soften the tone of my comment if I could though. :) — chappjc, Feb 18 '15 at 18:09
Debugging tools allowing you to "approach" where the errors start have improved a great deal since 2012 on CUDA. I have not worked with GUI based debuggers but the [CUDA tag wiki](http://stackoverflow.com/tags/cuda/info) mentions the command line cuda-gdb. This is a VERY powerful tool as it allows you to step through actual warps and threads on the GPU itself (requires 2.0+ architecture most of the time though) — opetrenko, Jan 09 '16 at 17:58
@bluefeet: what was the deal with the edit that you rolled back? It looked like nothing actually changed in the markdown, but it was accepted as an edit. Was there something nefarious at work? — talonmies, Mar 25 '17 at 18:09
@talonmies I won't comment on the nefarious nature, but [here are more details](https://meta.stackoverflow.com/questions/345750/invisible-revisions). — Taryn, Mar 27 '17 at 14:59

score 363 · Accepted Answer · edited Oct 30 '22 at 03:57

Probably the best way to check for errors in runtime API code is to define an assert style handler function and wrapper macro like this:

#define gpuErrchk(ans) { gpuAssert((ans), __FILE__, __LINE__); }
inline void gpuAssert(cudaError_t code, const char *file, int line, bool abort=true)
{
   if (code != cudaSuccess) 
   {
      fprintf(stderr,"GPUassert: %s %s %d\n", cudaGetErrorString(code), file, line);
      if (abort) exit(code);
   }
}

You can then wrap each API call with the gpuErrchk macro, which will process the return status of the API call it wraps, for example:

gpuErrchk( cudaMalloc(&a_d, size*sizeof(int)) );

If there is an error in a call, a textual message describing the error and the file and line in your code where the error occurred will be emitted to stderr and the application will exit. You could conceivably modify gpuAssert to raise an exception rather than call exit() in a more sophisticated application if it were required.

A second related question is how to check for errors in kernel launches, which can't be directly wrapped in a macro call like standard runtime API calls. For kernels, something like this:

kernel<<<1,1>>>(a);
gpuErrchk( cudaPeekAtLastError() );
gpuErrchk( cudaDeviceSynchronize() );

will firstly check for invalid launch argument, then force the host to wait until the kernel stops and checks for an execution error. The synchronisation can be eliminated if you have a subsequent blocking API call like this:

kernel<<<1,1>>>(a_d);
gpuErrchk( cudaPeekAtLastError() );
gpuErrchk( cudaMemcpy(a_h, a_d, size * sizeof(int), cudaMemcpyDeviceToHost) );

in which case the cudaMemcpy call can return either errors which occurred during the kernel execution or those from the memory copy itself. This can be confusing for the beginner, and I would recommend using explicit synchronisation after a kernel launch during debugging to make it easier to understand where problems might be arising.

Note that when using CUDA Dynamic Parallelism, a very similar methodology can and should be applied to any usage of the CUDA runtime API in device kernels, as well as after any device kernel launches:

#include <assert.h>
#define cdpErrchk(ans) { cdpAssert((ans), __FILE__, __LINE__); }
__device__ void cdpAssert(cudaError_t code, const char *file, int line, bool abort=true)
{
   if (code != cudaSuccess)
   {
      printf("GPU kernel assert: %s %s %d\n", cudaGetErrorString(code), file, line);
      if (abort) assert(0);
   }
}

CUDA Fortran error checking is analogous. See here and here for typical function error return syntax. A method similar to CUDA C++ is used to collect errors related to kernel launches.

shouldn't we add `cudaDeviceReset()` before exiting also? And a clause for memory deallocation? — Aurelius, Oct 21 '14 at 14:14
@talonmies: For Async CUDA runtime calls, such as cudaMemsetAsync and cudaMemcpyAsync, does it also require synchronizing gpu device and host thread via call to gpuErrchk( cudaDeviceSynchronize() ) ? — nurabha, May 26 '15 at 15:55
Note that explicit synchronization after kernel launch is not wrong but can severely alter execution performance and interleaving semantics. If you are using interleaving, doing explicit synchronization for debugging might hide a whole class of bugs which might be hard to track down in the Release build. — masterxilo, Apr 07 '16 at 00:36
Is there any way to get more specific errors for kernel executions? All the errors I'm getting just give me the line number from the host code, not from the kernel. — Azmisov, May 25 '17 at 02:15
Given the popularity of this answer, shouldn't someone put this macro in the CUDA API? I've seen this macro rolled into a billion repos now, each with small differences. — user14717, Apr 06 '19 at 15:50
Note that, unlike all other CUDA errors, kernel _launch_ errors will not be reported by subsequent synchronizing calls to the CUDA runtime API. Just putting `gpuErrchk()` around the next `cudaMemcpy()` or `cudaDeviceSynchronize()` call is thus insufficient to catch all possible error conditions. I'd argue it is better style to call `cudaGetLastError()` instead of `cudaPeekAtLastError()` immediately after a kernel launch` even though they have the same effect, to aid the unwitting reader. — tera, Jul 17 '19 at 20:40
I personally favor goto-based error handling (the sample code in The CUDA Handbook follows this pattern, which is also used by the Linux kernel); but whether you do that or build something like the gpuErrChk macro described here, you can use a pattern developed by Allan MacKinnon: define a macro that looks just like a CUDA runtime invocation: ```cuda(DeviceSynchronize());``` — ArchaeaSoftware, Oct 21 '20 at 16:36

Jared Hoberock · Answer 2 · 2015-05-18T20:55:44.483

talonmies' answer above is a fine way to abort an application in an assert-style manner.

Occasionally we may wish to report and recover from an error condition in a C++ context as part of a larger application.

Here's a reasonably terse way to do that by throwing a C++ exception derived from std::runtime_error using thrust::system_error:

#include <thrust/system_error.h>
#include <thrust/system/cuda/error.h>
#include <sstream>

void throw_on_cuda_error(cudaError_t code, const char *file, int line)
{
  if(code != cudaSuccess)
  {
    std::stringstream ss;
    ss << file << "(" << line << ")";
    std::string file_and_line;
    ss >> file_and_line;
    throw thrust::system_error(code, thrust::cuda_category(), file_and_line);
  }
}

This will incorporate the filename, line number, and an English language description of the cudaError_t into the thrown exception's .what() member:

#include <iostream>

int main()
{
  try
  {
    // do something crazy
    throw_on_cuda_error(cudaSetDevice(-1), __FILE__, __LINE__);
  }
  catch(thrust::system_error &e)
  {
    std::cerr << "CUDA error after cudaSetDevice: " << e.what() << std::endl;

    // oops, recover
    cudaSetDevice(0);
  }

  return 0;
}

The output:

$ nvcc exception.cu -run
CUDA error after cudaSetDevice: exception.cu(23): invalid device ordinal

A client of some_function can distinguish CUDA errors from other kinds of errors if desired:

try
{
  // call some_function which may throw something
  some_function();
}
catch(thrust::system_error &e)
{
  std::cerr << "CUDA error during some_function: " << e.what() << std::endl;
}
catch(std::bad_alloc &e)
{
  std::cerr << "Bad memory allocation during some_function: " << e.what() << std::endl;
}
catch(std::runtime_error &e)
{
  std::cerr << "Runtime error during some_function: " << e.what() << std::endl;
}
catch(...)
{
  std::cerr << "Some other kind of error during some_function" << std::endl;

  // no idea what to do, so just rethrow the exception
  throw;
}

Because thrust::system_error is a std::runtime_error, we can alternatively handle it in the same manner of a broad class of errors if we don't require the precision of the previous example:

try
{
  // call some_function which may throw something
  some_function();
}
catch(std::runtime_error &e)
{
  std::cerr << "Runtime error during some_function: " << e.what() << std::endl;
}

The thrust headers seem to have been rearranged. `` is now effectively ``. — chappjc, May 18 '15 at 20:05
Jared, I think my wrapper library subsumes your suggested solution - mostly, and is lightweight enough to propably be replacement. (See my answer) — einpoklum, Mar 22 '17 at 15:00

einpoklum · Answer 3 · 2023-05-16T10:36:20.937

The C++-canonical way: Don't check for errors; use the C++ bindings which throw exceptions.

I used to be irked by this problem; and I used to have a macro-cum-wrapper-function solution just like in Talonmies and Jared's answers, but, honestly? It makes using the CUDA Runtime API even more ugly and C-like.

So I've approached this in a different and more fundamental way. For a sample of the result, here's part of the CUDA vectorAdd sample - with complete error checking of every runtime API call:

// (... prepare host-side buffers here ...)

auto current_device = cuda::device::current::get();
namespace cm = cuda::memory;
auto d_A = cm::device::make_unique<float[]>(current_device, numElements);
auto d_B = cm::device::make_unique<float[]>(current_device, numElements);
auto d_C = cm::device::make_unique<float[]>(current_device, numElements);

cm::copy(d_A.get(), h_A.get(), size);
cm::copy(d_B.get(), h_B.get(), size);

auto launch_config = cuda::launch_config_builder()
    .overall_size(numElements)
    .block_size(256)
    .build();

cuda::launch(vectorAdd, launch_config,
    d_A.get(), d_B.get(), d_C.get(), numElements);    
cm::copy(h_C.get(), d_C.get(), size);

// (... verify results here...)

Again - all potential errors are checked , and an exception if an error occurred (caveat: If the kernel caused some error after launch, it will be caught after the attempt to copy the result, not before; to ensure the kernel was successful you would need to synchronize the device or the default stream).

The code above uses my

Thin Modern-C++ wrappers for the CUDA Runtime API library (Github)

Note that the exceptions carry both a string explanation and the CUDA runtime API status code after the failing call.

A few links to how CUDA errors are automagically checked with these wrappers:

It looks like the answer can be improved with a scoped C++ namespace. `using namespace cuda::memory`. More than three `::` per line really drives me away from C++ — Dimitri Lesnoff, May 16 '23 at 08:52
This looks much better! Thanks! Do we have to use `.get()` for each array and why? Can we fetch which device these arrays correspond to and use the same arrays during the rest of the execution? I would also alias this make_unique function into a cuda_make_unique_float function to abstract away all these scope resolutions and template instantiation. `make_unique` will act on arrays, so why do we need to specify the brackets `[]`? The template definition of the function may be improved. I hope I do not sound too harsh on the syntax. — Dimitri Lesnoff, May 16 '23 at 11:23
@DimitriLesnoff: I believe we're now getting into the territory of the design of my wrappers library - which is out of scope for this page (it is only concerned with error handling). Please consider filing an issue [here](https://github.com/eyalroz/cuda-api-wrappers/issues) or emailing me, or starting a chat session. — einpoklum, May 16 '23 at 11:38

score 16 · Answer 4 · edited Mar 15 '16 at 19:35

The solution discussed here worked well for me. This solution uses built-in cuda functions and is very simple to implement.

The relevant code is copied below:

#include <stdio.h>
#include <stdlib.h>

__global__ void foo(int *ptr)
{
  *ptr = 7;
}

int main(void)
{
  foo<<<1,1>>>(0);

  // make the host block until the device is finished with foo
  cudaDeviceSynchronize();

  // check for error
  cudaError_t error = cudaGetLastError();
  if(error != cudaSuccess)
  {
    // print the CUDA error message and exit
    printf("CUDA error: %s\n", cudaGetErrorString(error));
    exit(-1);
  }

  return 0;
}

score 1 · Answer 5 · answered Apr 21 '23 at 13:01

Let me add my favourite macro for CUDA error-checking.

#define CUDACHECK(err) do { cuda_check((err), __FILE__, __LINE__); } while(false)
inline void cuda_check(cudaError_t error_code, const char *file, int line)
{
    if (error_code != cudaSuccess)
    {
        fprintf(stderr, "CUDA Error %d: %s. In file '%s' on line %d\n", error_code, cudaGetErrorString(error_code), file, line);
        fflush(stderr);
        exit(error_code);
    }
}

Wrap each CUDA function call in the CUDACHECK() macro, and after each kernel launch, use CUDACHECK(cudaPeekAtLastError()).

The do{...}while(false) is there mainly to enforce a semicolon behind the macro. See this article for a great explanation why it is being used.

While fundamentally no different than the accepted answer, this at least defines the macro correctly using `do { } while`, with an explanation. — Mark Gates, Apr 26 '23 at 14:07

What is the canonical way to check for errors using the CUDA runtime API?

5 Answers5

The C++-canonical way: Don't check for errors; use the C++ bindings which throw exceptions.

Linked

Related