Cuda: how to reset GPU after "sticky" error?

Question

I have a multithreaded program, and only one thread is working with GPU (Cuda, C++). How to resume normal GPU processing after "sticky" error in cuda code in Linux?

I tried cudaDeviceReset():

__global__ void tKernel() {return; }
// it is a part of my code
cudaDeviceSynchronize();
cudaGetLastError();
cudaGetLastError();
cudaDeviceReset();
tKernel<<<1, 1>>>();
cudaDeviceSynchronize();
cudaError_t err = cudaGetLastError();
if (err != cudaSuccess) {
   std::cout << cudaGetErrorString(err);
}

But every time there is an error: "all CUDA-capable devices are busy or unavailable". The only way is to stop my program and then to run it again. But how to reset GPU without restarting program?

but your code should have no errors. Can you not avoid that? — Ander Biguri, Apr 30 '19 at 10:41
One of my kernels returns the error: "an illegal memory access". After that I'm trying to resume the normal GPU working, and this code - is how I do it. It is very difficult to reproduce the initial error "an illegal memory access" (since many different kernels are processing input data that maybe incorrect). So my question is - if there is a "sticky" error before this code, how to reset GPU? — Harry, Apr 30 '19 at 15:03
Yes, I get your question, my comment addresses your approach of solving it. You can add a big `try catch` to all codes available, but thats not what people do. You write code that does not error. In CUDA that should be your objective, dont let CUDA fail and recover, just make sure it does not fail. Yeah, debugging is hard, but its the job — Ander Biguri, Apr 30 '19 at 15:05
I will try to figure out of how to change the cuda code to avoid CUDA fail. But it may take a very long time. Also additional checks in cuda code may cause a performance degradation. In some cases it maybe better to recover CUDA after fail rather then to do additional checks in kernels. — Harry, Apr 30 '19 at 15:26
I did not say inside kernels. However, I am quite sure not erroring will be faster than erroring, in any case. — Ander Biguri, Apr 30 '19 at 15:34

Cuda: how to reset GPU after "sticky" error?

0 Answers0