TensorFlow crashing when trying to train model

Question

I'm trying to train a model in tensorflow, my code worked fine but then suddenly started crashing at the training phase. I've tried multiple "fixes"...from copying cuda .dll files to inserting the following code after my imports, but to no avail.

physical_devices = tf.config.list_physical_devices('GPU')
tf.config.experimental.set_memory_growth(physical_devices[0], True)

Here's error that pops up while model is being compiled:

tensorflow/stream_executor/cuda/cuda_driver.cc:794] failed to alloc 4294967296 bytes on host: CUDA_ERROR_OUT_OF_MEMORY: out of memory 2021-12-26 10:53:00.265328:

The model architecture:

Model: "sequential"
_______________________________________________________________
Layer (type)                 Output Shape              Param #
=================================================================
resizing (Resizing)          (None, 128, 128, 1)       0
_________________________________________________________________
normalization (Normalization (None, 128, 128, 1)       3
_________________________________________________________________
conv2d (Conv2D)              (None, 128, 128, 32)      320
_________________________________________________________________
max_pooling2d (MaxPooling2D) (None, 64, 64, 32)        0
_________________________________________________________________
conv2d_1 (Conv2D)            (None, 64, 64, 64)        18496
_________________________________________________________________
max_pooling2d_1 (MaxPooling2 (None, 32, 32, 64)        0
_________________________________________________________________
dropout (Dropout)            (None, 32, 32, 64)        0
_________________________________________________________________
flatten (Flatten)            (None, 65536)             0
_________________________________________________________________
dense (Dense)                (None, 216)               14155992
_________________________________________________________________
dropout_1 (Dropout)          (None, 216)               0
_________________________________________________________________
dense_1 (Dense)              (None, 36)                7812
=================================================================
Total params: 14,182,623
Trainable params: 14,182,620
Non-trainable params: 3
_______________________________________________________________

And the error that occurs when training begins: (I've cropped out recurrent logs of "failed to allocate memory")

2021-12-26 10:54:08.890289: I tensorflow/core/profiler/lib/profiler_session.cc:131] Profiler session initializing.
2021-12-26 10:54:08.891029: I tensorflow/core/profiler/lib/profiler_session.cc:146] Profiler session started.
2021-12-26 10:54:08.899859: I tensorflow/core/profiler/internal/gpu/cupti_tracer.cc:1614] Profiler found 1 GPUs
2021-12-26 10:54:08.933109: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'cupti64_112.dll'; dlerror: cupti64_112.dll not found
2021-12-26 10:54:08.947342: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'cupti.dll'; dlerror: cupti.dll not found
2021-12-26 10:54:08.948462: E tensorflow/core/profiler/internal/gpu/cupti_tracer.cc:1666] function cupti_interface_->Subscribe( &subscriber_, (CUpti_CallbackFunc)ApiCallback, this)failed with error CUPTI could not be loaded or symbol could not be found.
2021-12-26 10:54:08.956260: I tensorflow/core/profiler/lib/profiler_session.cc:164] Profiler session tear down.
2021-12-26 10:54:08.958977: E tensorflow/core/profiler/internal/gpu/cupti_tracer.cc:1757] function cupti_interface_->Finalize()failed with error CUPTI could not be loaded or symbol could not be found.
Epoch 1/50
2021-12-26 10:54:11.849166: I tensorflow/stream_executor/cuda/cuda_dnn.cc:369] Loaded cuDNN version 8204
2021-12-26 10:54:13.674500: W tensorflow/core/common_runtime/bfc_allocator.cc:272] Allocator (GPU_0_bfc) ran out of memory trying to allocate 144.00MiB with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
2021-12-26 10:54:13.674920: W tensorflow/core/kernels/gpu_utils.cc:49] Failed to allocate memory for convolution redzone checking; skipping this check. This is benign and only means that we won't check cudnn for out-of-bounds reads and writes. This message will only be printed once.

tensorflow/core/common_runtime/bfc_allocator.cc:1080] Stats:
Limit:                       519487488
InUse:                       515392000
MaxInUse:                    515465728
NumAllocs:                       23263
MaxAllocSize:                134217728
Reserved:                            0
PeakReserved:                        0
LargestFreeBlock:                    0

2021-12-26 10:54:29.147248: W tensorflow/core/common_runtime/bfc_allocator.cc:468] ***********xxxxxxxxxx**********************************************************************xxxxxxxxx
2021-12-26 10:54:29.151731: W tensorflow/core/framework/op_kernel.cc:1692] OP_REQUIRES failed at pooling_ops_common.cc:225 : Resource exhausted: OOM when allocating tensor with shape[64,64,32,32] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
Traceback (most recent call last):
  File "train.py", line 196, in <module>
    callbacks=[tf.keras.callbacks.ModelCheckpoint("models/net7_e50", monitor="val_loss", verbose=1, save_freq="epoch"), tf.keras.callbacks.TensorBoard("./logs/net7e50")],
  File "<Project_Directory>\venv\lib\site-packages\keras\engine\training.py", line 1184, in fit
    tmp_logs = self.train_function(iterator)
  File "<Project_Directory>\venv\lib\site-packages\tensorflow\python\eager\def_function.py", line 885, in __call__
    result = self._call(*args, **kwds)
  File "<Project_Directory>\venv\lib\site-packages\tensorflow\python\eager\def_function.py", line 950, in _call
    return self._stateless_fn(*args, **kwds)
  File "<Project_Directory>\venv\lib\site-packages\tensorflow\python\eager\function.py", line 3040, in __call__
    filtered_flat_args, captured_inputs=graph_function.captured_inputs)  # pylint: disable=protected-access
  File "<Project_Directory>\venv\lib\site-packages\tensorflow\python\eager\function.py", line 1964, in _call_flat
    ctx, args, cancellation_manager=cancellation_manager))
  File "<Project_Directory>\venv\lib\site-packages\tensorflow\python\eager\function.py", line 596, in call
    ctx=ctx)
  File "<Project_Directory>\venv\lib\site-packages\tensorflow\python\eager\execute.py", line 60, in quick_execute
    inputs, attrs, num_outputs)
tensorflow.python.framework.errors_impl.ResourceExhaustedError:  OOM when allocating tensor with shape[64,64,32,32] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
         [[node sequential/max_pooling2d_1/MaxPool (defined at train.py:196) ]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info. This isn't available when running in Eager mode.
 [Op:__inference_train_function_1432]

Function call stack:
train_function

Any help would be much appreciated!

As you are running out of memory, have you tried this: If the cause is memory fragmentation maybe the environment variable 'TF_GPU_ALLOCATOR=cuda_malloc_async' will improve the situation. — JJ., Dec 26 '21 at 19:48
@JJ I get the following error when trying to set TF_GPU_ALLOCATOR: ```NameError: name 'cuda_malloc_async' is not defined``` — Hammaad, Dec 26 '21 at 20:07
TF_GPU_ALLOCATOR is a environmental variable, not python code. — Dr. Snoopy, Dec 26 '21 at 20:11
And importantly, what model exactly are you training and on which GPU? — Dr. Snoopy, Dec 26 '21 at 20:13
Use [dotenv](https://stackoverflow.com/a/41547163/13871725) to set environment variables and access them. — JJ., Dec 26 '21 at 20:13
I set ```TF_GPU_ALLOCATOR=cuda_malloc_async``` in a .env file but get the same error. — Hammaad, Dec 26 '21 at 20:38
And you did not answer my question, what GPU are you using and how much RAM does it have? — Dr. Snoopy, Dec 26 '21 at 23:44
1 GB GPU RAM is very little, your code is trying to allocate 4 GB of GPU RAM, this is not a problem that you will solve with some DLLs or environmental variables, you need to make your code use significantly less GPU RAM. — Dr. Snoopy, Dec 27 '21 at 00:53

score 0 · Answer 1 · answered Jan 10 '22 at 12:10

(@paraphrased by Dr. Snoopy)

"1 GB GPU RAM is very little, your code is trying to allocate 4 GB of GPU RAM, this is not a problem that you will solve with some DLLs or environmental variables, you need to make your code use significantly less GPU RAM."

TensorFlow crashing when trying to train model

1 Answers1