Tf 2: Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR

Question

I am getting the above error (Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR) when I execute the code below. I have cheked if my gpu is woking using tf.test.is_gpu_available

# coding: utf-8

import tensorflow as tf
import numpy as np
import keras
from models import *
import os 
import gc 

TF_FORCE_GPU_ALLOW_GROWTH = True

np.random.seed(1000)
#Paths
MODEL_CONF = "../models/conf/"
MODEL_WEIGHTS = "../models/weights/"
#Model informations
N_CLASSES = 3


def load_array(name):
    return np.load(name, allow_pickle = True)


gc.collect()

dirData = "saved_data/"
trainDir = dirData + "train/"

model = AdaptedLeNet((168, 168, 8), N_CLASSES)
model.summary(print_fn=lambda x: print(x + '\n'))

# Compile the model with the specified loss function.
model.compile(optimizer=keras.optimizers.Adam(),
            loss='categorical_crossentropy',
            metrics=['accuracy'])

for filename in os.listdir(trainDir):
    data = load_array(trainDir + filename)

    train = data["a"]
    labels = data["b"].astype(int).reshape(-1) 
    one_hot_targets = np.eye(N_CLASSES)[labels]

    model.fit(x=train, y=one_hot_targets, batch_size=32, epochs=5)

    gc.collect()

The output of this code is:

Epoch 1/5
2020-04-03 18:50:43.397010: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10
2020-04-03 18:50:43.608330: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2020-04-03 18:50:44.274270: E tensorflow/stream_executor/cuda/cuda_dnn.cc:329] Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR
2020-04-03 18:50:44.275686: E tensorflow/stream_executor/cuda/cuda_dnn.cc:329] Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR
2020-04-03 18:50:44.275747: W tensorflow/core/common_runtime/base_collective_executor.cc:217] BaseCollectiveExecutor::StartAbort Unknown: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.
         [[{{node conv2d_1/convolution}}]]
Traceback (most recent call last):
  File "cnnAlert.py", line 62, in <module>
    model.fit(x=train, y=one_hot_targets, batch_size=32, epochs=5)
  File "/home/geodatin/env/py3GEE/lib/python3.6/site-packages/keras/engine/training.py", line 1239, in fit
    validation_freq=validation_freq)
  File "/home/geodatin/env/py3GEE/lib/python3.6/site-packages/keras/engine/training_arrays.py", line 196, in fit_loop
    outs = fit_function(ins_batch)
  File "/home/geodatin/env/py3GEE/lib/python3.6/site-packages/tensorflow_core/python/keras/backend.py", line 3727, in __call__
    outputs = self._graph_fn(*converted_inputs)
  File "/home/geodatin/env/py3GEE/lib/python3.6/site-packages/tensorflow_core/python/eager/function.py", line 1551, in __call__
    return self._call_impl(args, kwargs)
  File "/home/geodatin/env/py3GEE/lib/python3.6/site-packages/tensorflow_core/python/eager/function.py", line 1591, in _call_impl
    return self._call_flat(args, self.captured_inputs, cancellation_manager)
  File "/home/geodatin/env/py3GEE/lib/python3.6/site-packages/tensorflow_core/python/eager/function.py", line 1692, in _call_flat
    ctx, args, cancellation_manager=cancellation_manager))
  File "/home/geodatin/env/py3GEE/lib/python3.6/site-packages/tensorflow_core/python/eager/function.py", line 545, in call
    ctx=ctx)
  File "/home/geodatin/env/py3GEE/lib/python3.6/site-packages/tensorflow_core/python/eager/execute.py", line 67, in quick_execute
    six.raise_from(core._status_to_exception(e.code, message), None)
  File "<string>", line 3, in raise_from
tensorflow.python.framework.errors_impl.UnknownError:  Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.
         [[node conv2d_1/convolution (defined at /home/geodatin/env/py3GEE/lib/python3.6/site-packages/keras/backend/tensorflow_backend.py:3009) ]] [Op:__inference_keras_scratch_graph_2350]

Function call stack:
keras_scratch_graph

Some more informations:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.87.00    Driver Version: 418.87.00    CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 1660    Off  | 00000000:01:00.0  On |                  N/A |
| 27%   41C    P8     9W / 120W |    211MiB /  5911MiB |      1%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0       989      G   /usr/lib/xorg/Xorg                            78MiB |
|    0      1438      G   cinnamon                                      31MiB |
|    0      8622      G   ...uest-channel-token=16736224539216711033    99MiB |
+-----------------------------------------------------------------------------+
3

How do I solve this error? Can you help me?

EDIT 1

CUDNN_VERSION from cudnn.h : 7605 (7.6.5)
Host compiler version : GCC 7.5.0
Tensorflow: 2.1.0-rc0;
CUDNN lib is in my LD_LIBRARY_PATH

What is your TF version and CUDNN library version. Is CUDNN lib in your LD_LIBRARY_PATH? Generally, this error points to mismatched CUDA and CUDNN versions. — dgumo, Apr 03 '20 at 22:40

score 1 · Accepted Answer · answered Apr 15 '20 at 04:56

You might need to set the tensorflow session config.gpu_option.allow_growth to true, which can be done by adding the following to the top of your code:

gpu_options = tf.GPUOptions(allow_growth=True)
sess = tf.Session(config=tf.ConfigProto(gpu_options=gpu_options))
keras.backend.tensorflow_backend.set_session(sess)

score 1 · Answer 2 · answered Sep 21 '20 at 23:17

There is an answer on a question about TF1.0 which addresses how to do this for TF2. The suggestion from that answer worked for me, so I'll copy it in here. TF2 seems to be moving away from tf.Session, so I tend to prefer this suggestion to the other answer here.

physical_devices = tf.config.experimental.list_physical_devices('GPU')
assert len(physical_devices) > 0, "Not enough GPU hardware devices available"
config = tf.config.experimental.set_memory_growth(physical_devices[0], True)

Tf 2: Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR

2 Answers2

Linked