From the CUDA C++ Programming Guide (emphasis mine):
To ensure that any error returned by cudaPeekAtLastError()
or cudaGetLastError()
does not originate from calls prior to the kernel launch, one has to make sure that the runtime error variable is set to cudaSuccess
just before the kernel launch, for example, by calling cudaGetLastError()
just before the kernel launch.
Furthermore:
The runtime maintains an error variable for each host thread that is initialized to cudaSuccess
and is overwritten by the error code every time an error occurs (be it a parameter validation error or an asynchronous error).
I'm not quite sure how to interpret this in the case of multiple runtime instances in the same host thread though. I would think that they don't share the error state, but a citation would be welcome.
According to
if an asynchronous error occurs, it will be reported by some subsequent unrelated runtime function call.
there is no guarantee that an asynchronous error will only be reported after synchronization. Otherwise I would expect this part to write about e.g. a "synchronizing runtime function call".
I don't see how OP's linked sources are contradicting this. They use synchronization to guarantee that asynchronous errors are reported, they do not say that the asynchronous error could not have been reported before that synchronization.
To investigate further I performed two small experiments:
#include <stdio.h>
#include <unistd.h>
// from https://stackoverflow.com/a/14038590/10107454
#define gpuErrchk(ans) { gpuAssert((ans), __FILE__, __LINE__); }
inline void gpuAssert(cudaError_t code, const char *file, int line, bool abort=true)
{
if (code != cudaSuccess)
{
fprintf(stderr,"GPUassert: %s %s %d\n", cudaGetErrorString(code), file, line);
if (abort) exit(code);
}
}
__global__ void broken_kernel() {
int *ptr{};
*ptr = 0;
}
int main(void) {
broken_kernel<<<10, 256>>>();
gpuErrchk(cudaGetLastError());
// gpuErrchk(cudaDeviceSynchronize());
while (true) {
usleep(1);
gpuErrchk(cudaGetLastError());
}
return 0;
}
This seems to hang (in my particular setup). The asynchronous error certainly happens, as can be seen by commenting in the synchronization. But if we now add an asynchronous operation that communicates between host and device (cudaGetLastError
only seems to check the host variable), the error gets reported:
int main(void) {
int i{42};
int *ptr{};
cudaStream_t stream{};
gpuErrchk(cudaMalloc(&ptr, sizeof(int)));
gpuErrchk(cudaStreamCreate(&stream));
broken_kernel<<<10, 256>>>();
//gpuErrchk(cudaDeviceSynchronize());
while (true) {
gpuErrchk(cudaMemcpyAsync(ptr, &i, sizeof(int), cudaMemcpyHostToDevice, stream));
usleep(1);
}
gpuErrchk(cudaFree(ptr));
return 0;
}
I.e. asynchronous errors do not need synchronous/synchronizing CUDA runtime API calls to be reported. One can also put broken_kernel
on another stream (instead of the default stream) to make it even more clear that the error can be reported on a "unrelated", asynchronous runtime call. On the other hand there seems to be no risk that cudaGetLastError
will report an asynchronous error that was not already reported through some other API (like cudaMemcpyAsync
) that triggers signaling between host and device.
Feel free to file a bug with Nvidia if you want them to clarify the documentation. The current formulation seems be the same since at least CUDA 8.0.