Why did a TensorFlow session fail to launch due to out of memory error on GPU despite specifying `device_count={'CPU': 1, 'GPU': 0}`?

Question

I tried running on a server whose GPUs' vRAM were >96% used:

import tensorflow as tf

a = tf.constant(1, name = 'a')
b = tf.constant(3, name = 'b')
c = tf.constant(9, name = 'c')
d = tf.add(a, b, name='d')
e = tf.add(d, c, name='e')

session_conf = tf.ConfigProto(
          device_count={'CPU': 1, 'GPU': 0},
          allow_soft_placement=True
          )
sess = tf.Session(config=session_conf)
print(sess.run([d, e]))

It gave me CUDA_ERROR_OUT_OF_MEMORY error that stopped the execution of the program:

joe@doe:/scratch/test$ python3.5 shape.py
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcublas.so.8.0 locally
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcudnn.so.5 locally
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcufft.so.8.0 locally
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcuda.so.1 locally
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcurand.so.8.0 locally
E tensorflow/core/common_runtime/direct_session.cc:137] Internal: failed initializing StreamExecutor for CUDA device ordinal 0: Internal: failed call to cuDevicePrimaryCtxRetain: CUDA_ERROR_OUT_OF_MEMORY; total memory reported: 18446744073709551615
Traceback (most recent call last):
  File "shape.py", line 20, in <module>
    sess = tf.Session(config=session_conf)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1187, in __init__
    super(Session, self).__init__(target, graph, config=config)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 552, in __init__
    self._session = tf_session.TF_NewDeprecatedSession(opts, status)
  File "/usr/lib/python3.5/contextlib.py", line 66, in __exit__
    next(self.gen)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/errors_impl.py", line 469, in raise_exception_on_not_ok_status
    pywrap_tensorflow.TF_GetCode(status))
tensorflow.python.framework.errors_impl.InternalError: Failed to create session.

Why can the level of vRAM usage interfer with my program given that I specified device_count={'CPU': 1, 'GPU': 0}, allow_soft_placement=True when creating the TensorFlow session?

score 2 · Accepted Answer · answered Feb 23 '17 at 21:18

2

I'm not sure if device_count={'GPU': 0} works to prevent GPU memory allocation, I've not seen it used before. There's a chance it doesn't work because GPU allocator is a process-level concept since it's shared between sessions. So you are trying to configure a process-level setting through a session-level config. The most sure way is to make GPU invisible on a process level by setting env var -- export CUDA_VISIBLE_DEVICES=

answered Feb 23 '17 at 21:18

Yaroslav Bulatov

57,332
22
139
197

Please note that on windows `export CUDA_VISIBLE_DEVICES=` will not work (as I found out the hard way [here](https://stackoverflow.com/questions/44500733/tensorflow-allocating-gpu-memory-when-using-tf-device-cpu0/44513295?noredirect=1#comment76027592_44513295)). To effectively mask all GPUs you must set `CUDA_VISIBLE_DEVICES=-1` (or any other invalid device number) – GPhilo Jun 13 '17 at 13:37
It works for me, and that's pretty widely used, something must be special about your CUDA driver in your case. – Yaroslav Bulatov Jun 13 '17 at 15:24
Which CUDA SDK are you using? I'm using version 8 and in their documentation they don't specify the behaviour of the empty string – GPhilo Jun 13 '17 at 15:29
Also using 8. Both on Linux and MacOS. If it's not documented, it could be legacy behavior from earlier versions – Yaroslav Bulatov Jun 13 '17 at 15:36

Why did a TensorFlow session fail to launch due to out of memory error on GPU despite specifying `device_count={'CPU': 1, 'GPU': 0}`?

1 Answers1