at which points are asynchronous cuda errors synchronized to host

Question

I wonder how to handle certain cuda error conditions: cudaGetLastError() documentation states "Note that this function may also return error codes from previous, asynchronous launches"

Are these errors expected that they can pop up at any time or are they synchronized in some way with the host (eg. at sychronization points like cudaDeviceSynchronize)?

As a specific example:
If I launch a kernel and check cudaGetLastError() after that to catch kernel launch errors (because there is no other way like a direct return value of a kernel launch). Can I be sure that I only get errors from this (providing that error state was empty before this call) or is it possible that I can get random errors from other streams or asynchronous calls?

this presentation states on slide 5:

It is possible for a CUDA error to be detected during kernel execution That error will be signalled at the next CUDA runtime API call, after the error is detected

my interpretation of this page or enter link description here is the other way around, that asynch errors appear after synchronizing calls.

I am also not sure if I understand the line about multiple runtime instances correctly: "Multiple instances of the CUDA Runtime library can be present in an application when using a library that statically links the CUDA Runtime."

Does that mean, that if I use a dynamic library, multiple threads in my application share the same error-state but if using a static cuda lib, they do not share this? This does not make much sense to me.

Related: [What is the canonical way to check for errors using the CUDA runtime API?](https://stackoverflow.com/q/14038589/10107454) — paleonix, Jul 28 '23 at 13:50
@paleonix yes somewhat related (I also found this answer) but it does not really answer my question — vlad_tepesch, Jul 28 '23 at 14:03

paleonix · Accepted Answer · 2023-07-29T22:54:43.207

From the CUDA C++ Programming Guide (emphasis mine):

To ensure that any error returned by cudaPeekAtLastError() or cudaGetLastError() does not originate from calls prior to the kernel launch, one has to make sure that the runtime error variable is set to cudaSuccess just before the kernel launch, for example, by calling cudaGetLastError() just before the kernel launch.

Furthermore:

The runtime maintains an error variable for each host thread that is initialized to cudaSuccess and is overwritten by the error code every time an error occurs (be it a parameter validation error or an asynchronous error).

I'm not quite sure how to interpret this in the case of multiple runtime instances in the same host thread though. I would think that they don't share the error state, but a citation would be welcome.

According to

if an asynchronous error occurs, it will be reported by some subsequent unrelated runtime function call.

there is no guarantee that an asynchronous error will only be reported after synchronization. Otherwise I would expect this part to write about e.g. a "synchronizing runtime function call".

I don't see how OP's linked sources are contradicting this. They use synchronization to guarantee that asynchronous errors are reported, they do not say that the asynchronous error could not have been reported before that synchronization.

To investigate further I performed two small experiments:

#include <stdio.h>
#include <unistd.h> 

// from https://stackoverflow.com/a/14038590/10107454
#define gpuErrchk(ans) { gpuAssert((ans), __FILE__, __LINE__); }
inline void gpuAssert(cudaError_t code, const char *file, int line, bool abort=true)
{
   if (code != cudaSuccess) 
   {
      fprintf(stderr,"GPUassert: %s %s %d\n", cudaGetErrorString(code), file, line);
      if (abort) exit(code);
   }
}

__global__ void broken_kernel() {
    int *ptr{};
    *ptr = 0;
}

int main(void) {
  broken_kernel<<<10, 256>>>();
  gpuErrchk(cudaGetLastError());
  // gpuErrchk(cudaDeviceSynchronize());
  while (true) {
    usleep(1);
    gpuErrchk(cudaGetLastError());
  }

  return 0;
}

This seems to hang (in my particular setup). The asynchronous error certainly happens, as can be seen by commenting in the synchronization. But if we now add an asynchronous operation that communicates between host and device (cudaGetLastError only seems to check the host variable), the error gets reported:

int main(void) {
  int i{42};
  int *ptr{};
  cudaStream_t stream{};
  gpuErrchk(cudaMalloc(&ptr, sizeof(int)));
  gpuErrchk(cudaStreamCreate(&stream));
  broken_kernel<<<10, 256>>>();
  //gpuErrchk(cudaDeviceSynchronize());
  while (true) {
    gpuErrchk(cudaMemcpyAsync(ptr, &i, sizeof(int), cudaMemcpyHostToDevice, stream));
    usleep(1);
  }
  gpuErrchk(cudaFree(ptr));

  return 0;
}

I.e. asynchronous errors do not need synchronous/synchronizing CUDA runtime API calls to be reported. One can also put broken_kernel on another stream (instead of the default stream) to make it even more clear that the error can be reported on a "unrelated", asynchronous runtime call. On the other hand there seems to be no risk that cudaGetLastError will report an asynchronous error that was not already reported through some other API (like cudaMemcpyAsync) that triggers signaling between host and device.

Feel free to file a bug with Nvidia if you want them to clarify the documentation. The current formulation seems be the same since at least CUDA 8.0.

this is directly contradicted by the next paragraph in the programming guide "he only way to check for asynchronous errors just after some asynchronous function call is therefore to synchronize just after the call by calling" — vlad_tepesch, Jul 28 '23 at 14:34
"just after some asynchronous function call" , not generally... it doesn't contradict it. — paleonix, Jul 28 '23 at 14:40
Also this statement "Kernel launches are asynchronous, so to check for asynchronous errors, the application must synchronize in-between the kernel launch and the call to cudaPeekAtLastError() or cudaGetLastError()." It does not tell the obvious thing, that you have wait until the error occurs, but explicitly states, you have to synchronize — vlad_tepesch, Jul 28 '23 at 14:40
Well if you don't synchronize immediately after the launch, you can't be sure about the source of the error that is the whole point. I.e. if an error is reported and you have trouble identifying its cause, you will need to run your application again with disabled async launches (`CUDA_LAUNCH_BLOCKING=1`). Many async errors can also be identified using `compute-sanitizer` (or `cuda-memcheck` on older architectures). — paleonix, Jul 28 '23 at 14:46
I did some very crude testing: https://godbolt.org/z/7ehzYzYPb I run a kernel that fails after some time and check after start continuously if an error is reported by the cuda API during expected kernel runtime or after crash. It seems that the error only is recognized after a synch. — vlad_tepesch, Jul 28 '23 at 15:36
so we agree that its still unclear from the documentation ;-) — vlad_tepesch, Jul 28 '23 at 15:42
@vlad_tepesch Found the variation of the experiment that corresponds to my interpretation of the documentation... — paleonix, Jul 28 '23 at 16:01
it just shows that the error state at least is not updated asynchronously by the runtime/driver/whatever, but that a certain subset of the cuda API triggers this sychronization. So at least it should answer the question that no race condition (at least none not introduced by myself using threads) can inject unrelated errors between my kernel launch and and the `cudaGetLastError` — vlad_tepesch, Jul 28 '23 at 16:13

at which points are asynchronous cuda errors synchronized to host

1 Answers1