I am trying to run a tensorflow project and I am encountering memory problems on the university HPC cluster. I have to run a prediction job for hundreds of inputs, with differing lengths. We have GPU nodes with different amounts of vmem, so I am trying to set up the scripts in a way that will not crash in any combination of GPU node - input length.
After searching the net for solutions, I played around with TF_FORCE_UNIFIED_MEMORY, XLA_PYTHON_CLIENT_MEM_FRACTION, XLA_PYTHON_CLIENT_PREALLOCATE, and TF_FORCE_GPU_ALLOW_GROWTH, and also with tensorflow's set_memory_growth
. As I understood, with unified memory, I should be able to use more memory than a GPU has in itself.
This was my final solution (only relevant parts)
os.environ['TF_FORCE_UNIFIED_MEMORY']='1'
os.environ['XLA_PYTHON_CLIENT_MEM_FRACTION']='2.0'
#os.environ['XLA_PYTHON_CLIENT_PREALLOCATE']='false'
os.environ['TF_FORCE_GPU_ALLOW_GROWTH ']='true' # as I understood, this is redundant with the set_memory_growth part :)
import tensorflow as tf
gpus = tf.config.list_physical_devices('GPU')
if gpus:
try:
# Currently, memory growth needs to be the same across GPUs
for gpu in gpus:
print(gpu)
tf.config.experimental.set_memory_growth(gpu, True)
logical_gpus = tf.config.list_logical_devices('GPU')
print(len(gpus), "Physical GPUs,", len(logical_gpus), "Logical GPUs")
except RuntimeError as e:
# Memory growth must be set before GPUs have been initialized
print(e)
and I submit it on the cluster with --mem=30G
(slurm job scheduler) and --gres=gpu:1
.
And this is the error my code crashes with. As I understand, it does try to use the unified memory but is failing for some reason.
Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 5582 MB memory) -> physical GPU (device: 0, name: GeForce GTX TITAN Black, pci bus id: 0000:02:00.0, compute capability: 3.5)
2021-08-24 09:22:02.053935: E external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_driver.cc:764] failed to alloc 12758286336 bytes unified memory; result: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2021-08-24 09:22:03.738635: E external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_driver.cc:764] failed to alloc 11482457088 bytes unified memory; result: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2021-08-24 09:22:05.418059: E external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_driver.cc:764] failed to alloc 10334211072 bytes unified memory; result: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2021-08-24 09:22:07.102411: E external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_driver.cc:764] failed to alloc 9300789248 bytes unified memory; result: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2021-08-24 09:22:08.784349: E external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_driver.cc:764] failed to alloc 8370710016 bytes unified memory; result: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2021-08-24 09:22:10.468644: E external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_driver.cc:764] failed to alloc 7533638656 bytes unified memory; result: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2021-08-24 09:22:12.150588: E external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_driver.cc:764] failed to alloc 6780274688 bytes unified memory; result: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2021-08-24 09:23:10.326528: W external/org_tensorflow/tensorflow/core/common_runtime/bfc_allocator.cc:272] Allocator (GPU_0_bfc) ran out of memory trying to allocate 4.33GiB with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
Traceback (most recent call last):
File "scripts/script.py", line 654, in <module>
prediction_result, (r, t) = cf.to(model_runner.predict(processed_feature_dict, random_seed=seed), "cpu")
File "env/lib/python3.7/site-packages/alphafold/model/model.py", line 134, in predict
result, recycles = self.apply(self.params, jax.random.PRNGKey(random_seed), feat)
File "env/lib/python3.7/site-packages/jax/_src/traceback_util.py", line 183, in reraise_with_filtered_traceback
return fun(*args, **kwargs)
File "env/lib/python3.7/site-packages/jax/_src/api.py", line 402, in cache_miss
donated_invars=donated_invars, inline=inline)
File "env/lib/python3.7/site-packages/jax/core.py", line 1561, in bind
return call_bind(self, fun, *args, **params)
File "env/lib/python3.7/site-packages/jax/core.py", line 1552, in call_bind
outs = primitive.process(top_trace, fun, tracers, params)
File "env/lib/python3.7/site-packages/jax/core.py", line 1564, in process
return trace.process_call(self, fun, tracers, params)
File "env/lib/python3.7/site-packages/jax/core.py", line 607, in process_call
return primitive.impl(f, *tracers, **params)
File "env/lib/python3.7/site-packages/jax/interpreters/xla.py", line 608, in _xla_call_impl
*unsafe_map(arg_spec, args))
File "env/lib/python3.7/site-packages/jax/linear_util.py", line 262, in memoized_fun
ans = call(fun, *args)
File "env/lib/python3.7/site-packages/jax/interpreters/xla.py", line 758, in _xla_callable
compiled = compile_or_get_cached(backend, built, options)
File "env/lib/python3.7/site-packages/jax/interpreters/xla.py", line 76, in compile_or_get_cached
return backend_compile(backend, computation, compile_options)
File "env/lib/python3.7/site-packages/jax/interpreters/xla.py", line 373, in backend_compile
return backend.compile(built_c, compile_options=options)
jax._src.traceback_util.UnfilteredStackTrace: RuntimeError: Resource exhausted: Out of memory while trying to allocate 4649385984 bytes.
The stack trace below excludes JAX-internal frames.
The preceding is the original exception that occurred, unmodified.
--------------------
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "scripts/script.py", line 654, in <module>
prediction_result, (r, t) = cf.to(model_runner.predict(processed_feature_dict, random_seed=seed), "cpu")
File "env/lib/python3.7/site-packages/alphafold/model/model.py", line 134, in predict
result, recycles = self.apply(self.params, jax.random.PRNGKey(random_seed), feat)
File "env/lib/python3.7/site-packages/jax/interpreters/xla.py", line 373, in backend_compile
return backend.compile(built_c, compile_options=options)
RuntimeError: Resource exhausted: Out of memory while trying to allocate 4649385984 bytes.
I would be glad for any ideas on how to get it to work and use all the available memory.
Thank you!