2

I developed a model in Keras and trained it quite a few times. Once I forcefully stopped the training of the model and since then I am getting the following error:

Traceback (most recent call last):
  File "inception_resnet.py", line 246, in <module>
    callbacks=[checkpoint, saveEpochNumber])   ##
  File "/home/eh0/E27890/anaconda3/lib/python3.5/site-packages/keras/legacy/interfaces.py", line 87, in wrapper
    return func(*args, **kwargs)
  File "/home/eh0/E27890/anaconda3/lib/python3.5/site-packages/keras/engine/training.py", line 2042, in fit_generator
    class_weight=class_weight)
  File "/home/eh0/E27890/anaconda3/lib/python3.5/site-packages/keras/engine/training.py", line 1762, in train_on_batch
    outputs = self.train_function(ins)
  File "/home/eh0/E27890/anaconda3/lib/python3.5/site-packages/keras/backend/tensorflow_backend.py", line 2270, in __call__
    session = get_session()
  File "/home/eh0/E27890/anaconda3/lib/python3.5/site-packages/keras/backend/tensorflow_backend.py", line 163, in get_session
    _SESSION = tf.Session(config=config)
  File "/home/eh0/E27890/anaconda3/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1486, in __init__
    super(Session, self).__init__(target, graph, config=config)
  File "/home/eh0/E27890/anaconda3/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 621, in __init__
    self._session = tf_session.TF_NewDeprecatedSession(opts, status)
  File "/home/eh0/E27890/anaconda3/lib/python3.5/contextlib.py", line 66, in __exit__
    next(self.gen)
  File "/home/eh0/E27890/anaconda3/lib/python3.5/site-packages/tensorflow/python/framework/errors_impl.py", line 466, in raise_exception_on_not_ok_status
    pywrap_tensorflow.TF_GetCode(status))
tensorflow.python.framework.errors_impl.InternalError: Failed to create session.

So the error is actually

tensorflow.python.framework.errors_impl.InternalError: Failed to create session.

And most probably, the GPU memory is still occupied. I can't even create a simple tensorflow session.

I have seen an answer here, but when I execute the following command in terminal

export CUDA_VISIBLE_DEVICES=''

training of the model gets started without GPU acceleration.

Also, as I am training my model on a server and I have no root access either to the server, I can't restart the server or clear GPU memory with root access. What is the solution now?

Preetom Saha Arko
  • 2,588
  • 4
  • 21
  • 37

3 Answers3

7

I found the solution in a comment of this question.

nvidia-smi -q

This gives a list of all the processes (and their PIDs) occupying GPU memory. I killed them one by one by using

kill -9 PID

Now everything is running smooth again.

Preetom Saha Arko
  • 2,588
  • 4
  • 21
  • 37
  • I got the same error while my GPU usage is all zero, how do you find processes taking gpu memory in the output of `nvidia-smi -q` – K.Wanter May 26 '18 at 12:33
  • Find the "Processes" in the result, where you will find Process ID @K.Wanter – Preetom Saha Arko May 27 '18 at 05:21
  • thank you, turns out my problem was caused by driver version insufficient for cuda runtime version, update drive solve the problem – K.Wanter May 27 '18 at 15:15
  • The "Processes" shows blank in my system. Also nvidia-smi shows 0% GPU utilization. Is there any other way around without restarting the server? – Fitzerbirth Jan 21 '20 at 07:25
1

I am using Anaconda 4.5.12 with python 3.5, NVIDIA Driver 390.116 and also faced the same issue. In my case this was caused by incompatible cudatoolkit version

conda install tensorflow-gpu

installed cudatoolkit 9.3.0 with cudnn 7.3.x. However after going through answers here and referring to my other virtual environment where I use pytorch with GPU without any problem I inferred that cudatookit 9.0.0 will be compatible with my driver version.

conda install cudatoolkit==9.0.0

This installed cudatoolkit 9.0.0 and cudnn 7.3.0 from cuda 9.0_0 build. After this I was able to create tensorflow session with GPU.

Now coming to the options of killing jobs

  • If you have GPU memory occupied by other jobs then killing them one by one as suggested by @Preetam saha arko will free up GPU and may allow you to create tf session with GPU (provided that compatibility issues are resolved already)
  • To create Session with specified GPU, kill the previous tf.Session() request after finding PID from nvidia-smi and set cuda visible device to available GPU ID (0 for this example)

    import os os.environ["CUDA_VISIBLE_DEVICES"]='0'

    Then using tf.Session can create session with specified GPU device.

  • Otherwise, if nothing with GPU works then kill the previous tf.Session() request after finding PID from nvidia-smi and set cuda visible device to undefined

    import os os.environ["CUDA_VISIBLE_DEVICES"]=''

    Then using tf.Session can create session with CPU.

Saikat Das
  • 324
  • 2
  • 7
0

I had the similar problem, while working on the cluster. When I submitted the job script to Slurm server , it would run fine but while training the model on Jupytyter notebook, I would get the following error :

InternalError: Failed to create session

Reason : It was because I was running multiple jupyter notebooks under same GPU (all of them using tensorflow), so slurm server would restrict to create a new tensorflow session. The problem was solved by stopping all the jupyter notebook, and then running only one/two at a time.

Below is the log error for jupyter notebook :

Internal: failed call to cuDevicePrimaryCtxRetain: CUDA_ERROR_OUT_OF_MEMORY; total memory reported: 12786073600