12

When training either one of two different neural networks, one with Tensorflow and the other with Theano, sometimes after a random amount of time (could be a few hours or minutes, mostly a few hours), the execution freezes and I get this message by running "nvidia-smi":

"Unable to determine the device handle for GPU 0000:02:00.0: GPU is lost. Reboot the system to recover this GPU"

I tried to monitor the GPU performance for 13-hours execution, and everything seems stable: enter image description here

I'm working with:

  • Ubuntu 14.04.5 LTS
  • GPUs are Nvidia Titan Xp (this behavior repeats on another GPU on the same machine)
  • CUDA 8.0
  • CuDNN 5.1
  • Tensorflow 1.3
  • Theano 0.8.2

I'm not sure how to approach this problem, can anyone please suggest ideas of what can cause this and how to diagnose/fix this?

Mega
  • 495
  • 5
  • 16

1 Answers1

10

I posted this question a while ago, but after some investigation back then that took a few weeks, we managed to find the problem (and a solution). I don't remember all the details now, but I'm posting our main conclusion, in case someone will find it useful.

Bottom line is - the hardware we had was not strong enough to support high load GPU-CPU communication. We observed these issues on a rack server with 1 CPU and 4 GPU devices, There was simply an overload on the PCI bus. The problem was solved by adding another CPU to the rack server.

Mega
  • 495
  • 5
  • 16
  • Thank you for the answer! Did you remember how did you get that this was due to an overload on the PCI bus? – A. Attia Feb 20 '19 at 08:00
  • We tried to characterize when these failures happen in terms of the code we were running. We found they occur either when we use 3-4 GPUs in parallel or when running code that causes a lot of CPU-GPU traffic. Then we compared our server spec to commonly used specs and saw that usually there are two CPUs while we had just one. So we bought another one, and the problem was solved. – Mega Feb 20 '19 at 11:17
  • 1
    I also remember we looked a lot at system logs of the server and saw many warnings/errors from the PCI bus. Sorry for the lack of details, I didn't document our investigation process. – Mega Feb 20 '19 at 11:22