failed to alloc X bytes unified memory; result: CUDA_ERROR_OUT_OF_MEMORY: out of memory

Question

I am trying to run a tensorflow project and I am encountering memory problems on the university HPC cluster. I have to run a prediction job for hundreds of inputs, with differing lengths. We have GPU nodes with different amounts of vmem, so I am trying to set up the scripts in a way that will not crash in any combination of GPU node - input length.

After searching the net for solutions, I played around with TF_FORCE_UNIFIED_MEMORY, XLA_PYTHON_CLIENT_MEM_FRACTION, XLA_PYTHON_CLIENT_PREALLOCATE, and TF_FORCE_GPU_ALLOW_GROWTH, and also with tensorflow's set_memory_growth. As I understood, with unified memory, I should be able to use more memory than a GPU has in itself.

This was my final solution (only relevant parts)

os.environ['TF_FORCE_UNIFIED_MEMORY']='1'
os.environ['XLA_PYTHON_CLIENT_MEM_FRACTION']='2.0'
#os.environ['XLA_PYTHON_CLIENT_PREALLOCATE']='false'
os.environ['TF_FORCE_GPU_ALLOW_GROWTH ']='true' # as I understood, this is redundant with the set_memory_growth part :)

import tensorflow as tf    
gpus = tf.config.list_physical_devices('GPU')
if gpus:
  try:
    # Currently, memory growth needs to be the same across GPUs
    for gpu in gpus:
      print(gpu)
      tf.config.experimental.set_memory_growth(gpu, True)
    logical_gpus = tf.config.list_logical_devices('GPU')
    print(len(gpus), "Physical GPUs,", len(logical_gpus), "Logical GPUs")
  except RuntimeError as e:
    # Memory growth must be set before GPUs have been initialized
    print(e)

and I submit it on the cluster with --mem=30G (slurm job scheduler) and --gres=gpu:1.

And this is the error my code crashes with. As I understand, it does try to use the unified memory but is failing for some reason.

Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 5582 MB memory) -> physical GPU (device: 0, name: GeForce GTX TITAN Black, pci bus id: 0000:02:00.0, compute capability: 3.5)
2021-08-24 09:22:02.053935: E external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_driver.cc:764] failed to alloc 12758286336 bytes unified memory; result: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2021-08-24 09:22:03.738635: E external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_driver.cc:764] failed to alloc 11482457088 bytes unified memory; result: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2021-08-24 09:22:05.418059: E external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_driver.cc:764] failed to alloc 10334211072 bytes unified memory; result: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2021-08-24 09:22:07.102411: E external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_driver.cc:764] failed to alloc 9300789248 bytes unified memory; result: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2021-08-24 09:22:08.784349: E external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_driver.cc:764] failed to alloc 8370710016 bytes unified memory; result: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2021-08-24 09:22:10.468644: E external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_driver.cc:764] failed to alloc 7533638656 bytes unified memory; result: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2021-08-24 09:22:12.150588: E external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_driver.cc:764] failed to alloc 6780274688 bytes unified memory; result: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2021-08-24 09:23:10.326528: W external/org_tensorflow/tensorflow/core/common_runtime/bfc_allocator.cc:272] Allocator (GPU_0_bfc) ran out of memory trying to allocate 4.33GiB with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.


Traceback (most recent call last):
  File "scripts/script.py", line 654, in <module>
    prediction_result, (r, t) = cf.to(model_runner.predict(processed_feature_dict, random_seed=seed), "cpu")
  File "env/lib/python3.7/site-packages/alphafold/model/model.py", line 134, in predict
    result, recycles = self.apply(self.params, jax.random.PRNGKey(random_seed), feat)
  File "env/lib/python3.7/site-packages/jax/_src/traceback_util.py", line 183, in reraise_with_filtered_traceback
    return fun(*args, **kwargs)
  File "env/lib/python3.7/site-packages/jax/_src/api.py", line 402, in cache_miss
    donated_invars=donated_invars, inline=inline)
  File "env/lib/python3.7/site-packages/jax/core.py", line 1561, in bind
    return call_bind(self, fun, *args, **params)
  File "env/lib/python3.7/site-packages/jax/core.py", line 1552, in call_bind
    outs = primitive.process(top_trace, fun, tracers, params)
  File "env/lib/python3.7/site-packages/jax/core.py", line 1564, in process
    return trace.process_call(self, fun, tracers, params)
  File "env/lib/python3.7/site-packages/jax/core.py", line 607, in process_call
    return primitive.impl(f, *tracers, **params)
  File "env/lib/python3.7/site-packages/jax/interpreters/xla.py", line 608, in _xla_call_impl
    *unsafe_map(arg_spec, args))
  File "env/lib/python3.7/site-packages/jax/linear_util.py", line 262, in memoized_fun
    ans = call(fun, *args)
  File "env/lib/python3.7/site-packages/jax/interpreters/xla.py", line 758, in _xla_callable
    compiled = compile_or_get_cached(backend, built, options)
  File "env/lib/python3.7/site-packages/jax/interpreters/xla.py", line 76, in compile_or_get_cached
    return backend_compile(backend, computation, compile_options)
  File "env/lib/python3.7/site-packages/jax/interpreters/xla.py", line 373, in backend_compile
    return backend.compile(built_c, compile_options=options)
jax._src.traceback_util.UnfilteredStackTrace: RuntimeError: Resource exhausted: Out of memory while trying to allocate 4649385984 bytes.

The stack trace below excludes JAX-internal frames.
The preceding is the original exception that occurred, unmodified.

--------------------

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "scripts/script.py", line 654, in <module>
    prediction_result, (r, t) = cf.to(model_runner.predict(processed_feature_dict, random_seed=seed), "cpu")
  File "env/lib/python3.7/site-packages/alphafold/model/model.py", line 134, in predict
    result, recycles = self.apply(self.params, jax.random.PRNGKey(random_seed), feat)
  File "env/lib/python3.7/site-packages/jax/interpreters/xla.py", line 373, in backend_compile
    return backend.compile(built_c, compile_options=options)
RuntimeError: Resource exhausted: Out of memory while trying to allocate 4649385984 bytes.

I would be glad for any ideas on how to get it to work and use all the available memory.

Thank you!

OOM is not a programming error. I think before starting the training, you should first compute how much VRAM would be consumed for given batch size and adjust accordingly. Also, you can try the gradient accumulation technique. And also mixed-precision training. — Innat, Aug 27 '21 at 14:55
Thank you for the answer M.Innat. I am not training, but predicting with a model, but there is an option to turn a training feature, dropout, on and that's when the OOM occures. — aqua, Aug 28 '21 at 16:14

score 2 · Accepted Answer · answered Sep 01 '21 at 12:43

Looks like your GPU doesn't fully support unified memory. The support is limited and in fact the GPU holds all data in its memory.

See this article for the description: https://developer.nvidia.com/blog/unified-memory-cuda-beginners/

In particular:

On systems with pre-Pascal GPUs like the Tesla K80, calling cudaMallocManaged() allocates size bytes of managed memory on the GPU device that is active when the call is made. Internally, the driver also sets up page table entries for all pages covered by the allocation, so that the system knows that the pages are resident on that GPU.

And:

Since these older GPUs can’t page fault, all data must be resident on the GPU just in case the kernel accesses it (even if it won’t).

And your GPU is Kepler-based, according to TechPowerUp: https://www.techpowerup.com/gpu-specs/geforce-gtx-titan-black.c2549

As far as I know, TensorFlow should also issue a warning about that. Something like:

Unified memory on GPUs with compute capability lower than 6.0 (pre-Pascal class GPUs) does not support oversubscription.

score -1 · Answer 2 · answered Aug 29 '21 at 18:26

Probably this answer will be useful for you. This nvidia_smi python module have some useful tools like checking the gpu total memory. Here I reproduce the code of the answer I mentioned earlier.

import nvidia_smi

nvidia_smi.nvmlInit()
handle = nvidia_smi.nvmlDeviceGetHandleByIndex(0)

info = nvidia_smi.nvmlDeviceGetMemoryInfo(handle)

print("Total memory:", info.total)

nvidia_smi.nvmlShutdown()

I think this should be your starting point. A simple solution would be to set the batch size according to the gpu memory. If you only want to get predictions, except of the batch_size, usually there is no anything else that much memory intensive. Also, I would recommend, if there is any preprocessing done on gpu, pass it to cpu.

Thank you for the answer! Unfortunately, I already moved everything to CPU that I could. Moreover, in the future I might need to run it on bigger inputs, so I do not want to tailor it to my current set. As far as I understood, unifying the memory should work on these cases, but for some reason it cannot allocate. I cannot find any solution on the internet for this specific 'failed to alloc 12758286336 bytes unified memory;' error — aqua, Aug 31 '21 at 05:16

failed to alloc X bytes unified memory; result: CUDA_ERROR_OUT_OF_MEMORY: out of memory

2 Answers2