I'm trying to train a model in tensorflow, my code worked fine but then suddenly started crashing at the training phase. I've tried multiple "fixes"...from copying cuda .dll files to inserting the following code after my imports, but to no avail.
physical_devices = tf.config.list_physical_devices('GPU')
tf.config.experimental.set_memory_growth(physical_devices[0], True)
Here's error that pops up while model is being compiled:
tensorflow/stream_executor/cuda/cuda_driver.cc:794] failed to alloc 4294967296 bytes on host: CUDA_ERROR_OUT_OF_MEMORY: out of memory 2021-12-26 10:53:00.265328:
The model architecture:
Model: "sequential"
_______________________________________________________________
Layer (type) Output Shape Param #
=================================================================
resizing (Resizing) (None, 128, 128, 1) 0
_________________________________________________________________
normalization (Normalization (None, 128, 128, 1) 3
_________________________________________________________________
conv2d (Conv2D) (None, 128, 128, 32) 320
_________________________________________________________________
max_pooling2d (MaxPooling2D) (None, 64, 64, 32) 0
_________________________________________________________________
conv2d_1 (Conv2D) (None, 64, 64, 64) 18496
_________________________________________________________________
max_pooling2d_1 (MaxPooling2 (None, 32, 32, 64) 0
_________________________________________________________________
dropout (Dropout) (None, 32, 32, 64) 0
_________________________________________________________________
flatten (Flatten) (None, 65536) 0
_________________________________________________________________
dense (Dense) (None, 216) 14155992
_________________________________________________________________
dropout_1 (Dropout) (None, 216) 0
_________________________________________________________________
dense_1 (Dense) (None, 36) 7812
=================================================================
Total params: 14,182,623
Trainable params: 14,182,620
Non-trainable params: 3
_______________________________________________________________
And the error that occurs when training begins: (I've cropped out recurrent logs of "failed to allocate memory")
2021-12-26 10:54:08.890289: I tensorflow/core/profiler/lib/profiler_session.cc:131] Profiler session initializing.
2021-12-26 10:54:08.891029: I tensorflow/core/profiler/lib/profiler_session.cc:146] Profiler session started.
2021-12-26 10:54:08.899859: I tensorflow/core/profiler/internal/gpu/cupti_tracer.cc:1614] Profiler found 1 GPUs
2021-12-26 10:54:08.933109: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'cupti64_112.dll'; dlerror: cupti64_112.dll not found
2021-12-26 10:54:08.947342: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'cupti.dll'; dlerror: cupti.dll not found
2021-12-26 10:54:08.948462: E tensorflow/core/profiler/internal/gpu/cupti_tracer.cc:1666] function cupti_interface_->Subscribe( &subscriber_, (CUpti_CallbackFunc)ApiCallback, this)failed with error CUPTI could not be loaded or symbol could not be found.
2021-12-26 10:54:08.956260: I tensorflow/core/profiler/lib/profiler_session.cc:164] Profiler session tear down.
2021-12-26 10:54:08.958977: E tensorflow/core/profiler/internal/gpu/cupti_tracer.cc:1757] function cupti_interface_->Finalize()failed with error CUPTI could not be loaded or symbol could not be found.
Epoch 1/50
2021-12-26 10:54:11.849166: I tensorflow/stream_executor/cuda/cuda_dnn.cc:369] Loaded cuDNN version 8204
2021-12-26 10:54:13.674500: W tensorflow/core/common_runtime/bfc_allocator.cc:272] Allocator (GPU_0_bfc) ran out of memory trying to allocate 144.00MiB with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
2021-12-26 10:54:13.674920: W tensorflow/core/kernels/gpu_utils.cc:49] Failed to allocate memory for convolution redzone checking; skipping this check. This is benign and only means that we won't check cudnn for out-of-bounds reads and writes. This message will only be printed once.
tensorflow/core/common_runtime/bfc_allocator.cc:1080] Stats:
Limit: 519487488
InUse: 515392000
MaxInUse: 515465728
NumAllocs: 23263
MaxAllocSize: 134217728
Reserved: 0
PeakReserved: 0
LargestFreeBlock: 0
2021-12-26 10:54:29.147248: W tensorflow/core/common_runtime/bfc_allocator.cc:468] ***********xxxxxxxxxx**********************************************************************xxxxxxxxx
2021-12-26 10:54:29.151731: W tensorflow/core/framework/op_kernel.cc:1692] OP_REQUIRES failed at pooling_ops_common.cc:225 : Resource exhausted: OOM when allocating tensor with shape[64,64,32,32] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
Traceback (most recent call last):
File "train.py", line 196, in <module>
callbacks=[tf.keras.callbacks.ModelCheckpoint("models/net7_e50", monitor="val_loss", verbose=1, save_freq="epoch"), tf.keras.callbacks.TensorBoard("./logs/net7e50")],
File "<Project_Directory>\venv\lib\site-packages\keras\engine\training.py", line 1184, in fit
tmp_logs = self.train_function(iterator)
File "<Project_Directory>\venv\lib\site-packages\tensorflow\python\eager\def_function.py", line 885, in __call__
result = self._call(*args, **kwds)
File "<Project_Directory>\venv\lib\site-packages\tensorflow\python\eager\def_function.py", line 950, in _call
return self._stateless_fn(*args, **kwds)
File "<Project_Directory>\venv\lib\site-packages\tensorflow\python\eager\function.py", line 3040, in __call__
filtered_flat_args, captured_inputs=graph_function.captured_inputs) # pylint: disable=protected-access
File "<Project_Directory>\venv\lib\site-packages\tensorflow\python\eager\function.py", line 1964, in _call_flat
ctx, args, cancellation_manager=cancellation_manager))
File "<Project_Directory>\venv\lib\site-packages\tensorflow\python\eager\function.py", line 596, in call
ctx=ctx)
File "<Project_Directory>\venv\lib\site-packages\tensorflow\python\eager\execute.py", line 60, in quick_execute
inputs, attrs, num_outputs)
tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[64,64,32,32] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
[[node sequential/max_pooling2d_1/MaxPool (defined at train.py:196) ]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info. This isn't available when running in Eager mode.
[Op:__inference_train_function_1432]
Function call stack:
train_function
Any help would be much appreciated!