0

I'm trying to accelerate a model I've built with keras and after some difficulty with cuda library versions I've managed to get tensorflow to detect my GPU. However now when I run the model with my GPU detected, it fails with the following traceback:

2021-01-20 17:40:26.549946: W tensorflow/core/common_runtime/bfc_allocator.cc:441] ****___*********____________________________________________________________________________________
Traceback (most recent call last):
  File "model.py", line 72, in <module>
    history = model.fit(x_train, y_train, batch_size=batch_size, epochs=epochs, verbose=2, validation_data=(x_val, y_val))
  File "/home/muke/.local/lib/python3.8/site-packages/tensorflow/python/keras/engine/training.py", line 1100, in fit
    tmp_logs = self.train_function(iterator)
  File "/home/muke/.local/lib/python3.8/site-packages/tensorflow/python/eager/def_function.py", line 828, in __call__
    result = self._call(*args, **kwds)
  File "/home/muke/.local/lib/python3.8/site-packages/tensorflow/python/eager/def_function.py", line 888, in _call
    return self._stateless_fn(*args, **kwds)
  File "/home/muke/.local/lib/python3.8/site-packages/tensorflow/python/eager/function.py", line 2942, in __call__
    return graph_function._call_flat(
  File "/home/muke/.local/lib/python3.8/site-packages/tensorflow/python/eager/function.py", line 1918, in _call_flat
    return self._build_call_outputs(self._inference_function.call(
  File "/home/muke/.local/lib/python3.8/site-packages/tensorflow/python/eager/function.py", line 555, in call
    outputs = execute.execute(
  File "/home/muke/.local/lib/python3.8/site-packages/tensorflow/python/eager/execute.py", line 59, in quick_execute
    tensors = pywrap_tfe.TFE_Py_Execute(ctx._handle, device_name, op_name,
tensorflow.python.framework.errors_impl.ResourceExhaustedError:  SameWorkerRecvDone unable to allocate output tensor. Key: /job:localhost/replica:0/task:0/device:CPU:0;ccc21c10a2feabe0;/job:localhost/replica:0/task:0/device:GPU:0;edge_17_IteratorGetNext;0:0
     [[{{node IteratorGetNext/_2}}]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
 [Op:__inference_train_function_875]

Function call stack:
train_function

The model runs fine on just CPU.

I'm unsure if this is related to the versioning or not, but to be sure I'll detail the situation. I'm running gentoo but due to how heavy the tensorflow package is to compile I've downloaded a binary package through pip, which has version 2.4.0. I've installed the lastest nvidia-cuda-toolkit package as well as cudnn through my distro's package manager, but when I do this and test to see if tensorflow detects my GPU, it says it can't find libcusolver.so.10, when instead I have libcusolver.so.11 through the latest version. I tried to downgrade to a version of the cuda toolkit which had libcusolver.so.10, but then tensorflow would complain about not being able to find several other version 11 libraries, so I've installed the latest cuda toolkit package but included in the /opt/cuda/lib64 directory the older libcusolver.so.10 files as well. I understand this is a hacky solution but I'm not sure what else I can do if that's what it's looking for.

Here's my full model code using keras:

model = Sequential()
model.add(Conv2D(8, (7,7), activation='relu', input_shape=input_shape))
model.add(MaxPooling2D(pool_size=(2,2)))
model.add(Dropout(0.25))

model.add(Conv2D(16, (7,7), activation='relu'))
model.add(MaxPooling2D(pool_size=(2,2)))
model.add(Dropout(0.25))

model.add(Flatten())

model.add(Dense(64, activation='relu'))
model.add(Dropout(0.25))

model.add(Dense(num_classes, activation='softmax'))

model.summary()

batch_size = 1000
epochs = 100

model.compile(loss=keras.losses.categorical_crossentropy, optimizer=keras.optimizers.Adam(learning_rate=0.001), metrics=['accuracy'])

history = model.fit(x_train, y_train, batch_size=batch_size, epochs=epochs, verbose=2, validation_data=(x_val, y_val))
talonmies
  • 70,661
  • 34
  • 192
  • 269
muke
  • 306
  • 2
  • 11
  • doesn't "tensorflow.python.framework.errors_impl.ResourceExhaustedError: SameWorkerRecvDone unable to allocate output tensor" simply mean that you ran out of memory, possibly on the GPU? Also, simply copying over a few libraries seems like a surefire way to break things. seems like tf2.4 and cuda 11 should work fine: https://www.tensorflow.org/install/source#tested_build_configurations – geebert Jan 20 '21 at 19:41
  • 1
    Does this answer your question? [Understanding the ResourceExhaustedError: OOM when allocating tensor with shape](https://stackoverflow.com/questions/46066850/understanding-the-resourceexhaustederror-oom-when-allocating-tensor-with-shape) – geebert Jan 20 '21 at 19:44
  • What is the input image dataset size you are using for this model definition? –  Apr 13 '22 at 19:21

0 Answers0