How can I reset the CUDA error to success with Driver API after a trap instruction?

Question

I have a kernel, which might call asm("trap;") inside kernel. But when that happens, the CUDA error code is set to launch fail, and I cannot reset it.

In CUDA Runtime API, we can use cudaGetLastError to get the last error and in the mean time, reset it to cudaSuccess.

Is there a way to do that with Driver API?

Robert Crovella · Accepted Answer · 2019-05-27T17:56:31.977

This type of error cannot be reset with the CUDA Runtime API cudaGetLastError() function.

There are two types of CUDA runtime errors: "sticky" and "non-sticky". "non-sticky" errors are those which do not corrupt the context. For example, a cudaMalloc request that is asking for more than the available memory will fail, but it will not corrupt the context. Such an error is "non-sticky".

Errors that involve unexpected termination of a CUDA kernel (including your trap example, also in-kernel assert() failures, also runtime detected execution errors such as out-of-bounds accesses) are "sticky". You cannot clear "sticky" errors with cudaGetLastError(). The only method to clear these errors in the runtime API is cudaDeviceReset() (which eliminates all device allocations, and wipes out the context).

The corresponding driver API function is cuDevicePrimaryCtxReset()

Note that cudaDeviceReset() by itself is insufficient to restore a GPU to proper functional behavior. In order to accomplish that, the "owning" process must also terminate. See here.

Indeed. So the takeaway/conclusion is that casual usage of `trap` isn't really a good idea. The only time I ever use in-kernel `assert()` which is similar is when I have a catastrophic failure. — Robert Crovella, Apr 27 '17 at 15:00

How can I reset the CUDA error to success with Driver API after a trap instruction?

1 Answers1

Linked