1

Summary of my problem

When I execute a code with tensorflow-gpu, I got a error as the title. This error happens in every code which contains convolution layer.

Environment

  • Ubuntu 18.04
  • Python 3.7.1
  • tensorflow-gpu 1.13.1
  • CUDA 10.1
  • CuDNN 7.4.2

Detail around the GPU

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.43       Driver Version: 418.43       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce RTX 2080    Off  | 00000000:01:00.0  On |                  N/A |
|  0%   46C    P8    21W / 215W |    568MiB /  7949MiB |      1%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0      1733      G   /usr/lib/xorg/Xorg                            18MiB |
|    0      1771      G   /usr/bin/gnome-shell                          57MiB |
|    0      2698      G   /usr/lib/xorg/Xorg                           175MiB |
|    0      2813      G   /usr/bin/gnome-shell                         168MiB |
|    0      3339      G   ...uest-channel-token=11703333986562712743    76MiB |
|    0      8579      G   /proc/self/exe                                67MiB |
+-----------------------------------------------------------------------------+

PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin:/usr/local/cuda-10.0/bin
CUDA_PATH=/usr/local/cuda-10.0
LD_LIBRARY_PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin:/usr/local/cuda-10.0/lib64
export LD_LIBRARY_PATH="/usr/local/cuda/lib64:/usr/local/cuda/extras/CUPTI/lib64:$LD_LIBRARY_PATH    "
export PATH="/usr/local/cuda/bin:$PATH"

The whole error message

2019-06-29 23:13:22.132275: I tensorflow/stream_executor/dso_loader.cc:152] successfully opened CUDA library libcublas.so.10.0 locally
2019-06-29 23:13:22.803064: E tensorflow/stream_executor/cuda/cuda_dnn.cc:334] Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR
2019-06-29 23:13:22.805965: E tensorflow/stream_executor/cuda/cuda_dnn.cc:334] Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR
Traceback (most recent call last):
  File "train.py", line 90, in <module>
    main(args)
  File "train.py", line 81, in main
    callbacks=[callback]
  File "/home/yudai/.local/lib/python3.7/site-packages/tensorflow/python/keras/engine/training.py", line 1426, in fit_generator
    initial_epoch=initial_epoch)
  File "/home/yudai/.local/lib/python3.7/site-packages/tensorflow/python/keras/engine/training_generator.py", line 191, in model_iteration
    batch_outs = batch_function(*batch_data)
  File "/home/yudai/.local/lib/python3.7/site-packages/tensorflow/python/keras/engine/training.py", line 1191, in train_on_batch
    outputs = self._fit_function(ins)  # pylint: disable=not-callable
  File "/home/yudai/.local/lib/python3.7/site-packages/tensorflow/python/keras/backend.py", line 3076, in __call__
    run_metadata=self.run_metadata)
  File "/home/yudai/.local/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1439, in __call__
    run_metadata_ptr)
  File "/home/yudai/.local/lib/python3.7/site-packages/tensorflow/python/framework/errors_impl.py", line 528, in __exit__
    c_api.TF_GetCode(self.status.status))
tensorflow.python.framework.errors_impl.UnknownError: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.
     [[{{node block1_conv1/Conv2D}}]]
     [[{{node loss/arc_face_loss/broadcast_weights/assert_broadcastable/is_valid_shape/has_valid_nonscalar_shape/has_invalid_dims/concat}}]]

It says "Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR", so I estimate that is caused by the CuDNN. I tried some way such as sudo rm -rf ~/.nv/ in this question and config.gpu_options.allow_growth = True in this GitHub issue, but I cannot resolve.

Please tell me the solution of this problem.

talonmies
  • 70,661
  • 34
  • 192
  • 269
Y. P
  • 506
  • 1
  • 5
  • 12
  • Do you get any errors, if you try to verify your CUDA installation as instructed in the [this](https://xcat-docs.readthedocs.io/en/stable/advanced/gpu/nvidia/verify_cuda_install.html) link? – georg-un Jun 29 '19 at 15:08
  • Also, I find [a](https://devtalk.nvidia.com/default/topic/1047898/cuda-10-1-tensorflow-1-13) [lot](https://github.com/tensorflow/tensorflow/issues/26289) [of](https://stackoverflow.com/questions/54969020/advice-on-tensorflow-1-13-on-cuda-10-1) issues regarding tensorflow-gpu 1.13.1 together with CUDA 10.1. Maybe try an upgrade of your tensorflow-gpu package. – georg-un Jun 29 '19 at 15:19
  • Thank you for your comment. I tried, but I got no error when 'make' the samples, and I tried tensorflow-gpu 1.14 and 2.0b2, but I got the same error. – Y. P Jun 29 '19 at 15:28

1 Answers1

-2

try this code

physical_devices = tf.config.experimental.list_physical_devices('GPU')
tf.config.experimental.set_memory_growth(physical_devices[0], True)

it worked for me

Ahmet
  • 7,527
  • 3
  • 23
  • 47