2

My computer specification are: Windows 10 cuda 11.2 cudnn 8.0.5 Nvidia geforce GTX 3080

I used this web(https://github.com/armaanpriyadarshan/Training-a-Custom-TensorFlow-2.x-Object-Detector) to install faster rcnn. When I trained this network, it had an error:

2021-01-24 18:12:47.713443: E tensorflow/stream_executor/cuda/cuda_dnn.cc:336] Could not create cudnn handle: CUDNN_STATUS_NOT_INITIALIZED
2021-01-24 18:12:47.715010: E tensorflow/stream_executor/cuda/cuda_dnn.cc:340] Error retrieving driver version: Unimplemented: kernel reported driver version not implemented on Windows
2021-01-24 18:12:47.718097: E tensorflow/stream_executor/cuda/cuda_dnn.cc:336] Could not create cudnn handle: CUDNN_STATUS_NOT_INITIALIZED
2021-01-24 18:12:47.719553: E tensorflow/stream_executor/cuda/cuda_dnn.cc:340] Error retrieving driver version: Unimplemented: kernel reported driver version not implemented on Windows
Traceback (most recent call last):
  File "model_main_tf2.py", line 113, in <module>
    tf.compat.v1.app.run()
  File "C:\Anaconda\envs\tensorflow\lib\site-packages\tensorflow\python\platform\app.py", line 40, in run
    _run(main=main, argv=argv, flags_parser=_parse_flags_tolerate_undef)
  File "C:\Anaconda\envs\tensorflow\lib\site-packages\absl\app.py", line 300, in run
    _run_main(main, args)
  File "C:\Anaconda\envs\tensorflow\lib\site-packages\absl\app.py", line 251, in _run_main
    sys.exit(main(argv))
  File "model_main_tf2.py", line 104, in main
    model_lib_v2.train_loop(
  File "C:\Anaconda\envs\tensorflow\lib\site-packages\object_detection\model_lib_v2.py", line 561, in train_loop
    load_fine_tune_checkpoint(detection_model,
  File "C:\Anaconda\envs\tensorflow\lib\site-packages\object_detection\model_lib_v2.py", line 361, in load_fine_tune_checkpoint
    strategy.run(
  File "C:\Anaconda\envs\tensorflow\lib\site-packages\tensorflow\python\distribute\distribute_lib.py", line 1259, in run
    return self._extended.call_for_each_replica(fn, args=args, kwargs=kwargs)
  File "C:\Anaconda\envs\tensorflow\lib\site-packages\tensorflow\python\distribute\distribute_lib.py", line 2730, in call_for_each_replica
    return self._call_for_each_replica(fn, args, kwargs)
  File "C:\Anaconda\envs\tensorflow\lib\site-packages\tensorflow\python\distribute\mirrored_strategy.py", line 628, in _call_for_each_replica
    return mirrored_run.call_for_each_replica(
  File "C:\Anaconda\envs\tensorflow\lib\site-packages\tensorflow\python\distribute\mirrored_run.py", line 75, in call_for_each_replica
    return wrapped(args, kwargs)
  File "C:\Anaconda\envs\tensorflow\lib\site-packages\tensorflow\python\eager\def_function.py", line 828, in __call__
    result = self._call(*args, **kwds)
  File "C:\Anaconda\envs\tensorflow\lib\site-packages\tensorflow\python\eager\def_function.py", line 888, in _call
    return self._stateless_fn(*args, **kwds)
  File "C:\Anaconda\envs\tensorflow\lib\site-packages\tensorflow\python\eager\function.py", line 2942, in __call__
    return graph_function._call_flat(
  File "C:\Anaconda\envs\tensorflow\lib\site-packages\tensorflow\python\eager\function.py", line 1918, in _call_flat
    return self._build_call_outputs(self._inference_function.call(
  File "C:\Anaconda\envs\tensorflow\lib\site-packages\tensorflow\python\eager\function.py", line 555, in call
    outputs = execute.execute(
  File "C:\Anaconda\envs\tensorflow\lib\site-packages\tensorflow\python\eager\execute.py", line 59, in quick_execute
    tensors = pywrap_tfe.TFE_Py_Execute(ctx._handle, device_name, op_name,
tensorflow.python.framework.errors_impl.UnknownError: 2 root error(s) found.
  (0) Unknown:  Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.
         [[node model/conv1_conv/Conv2D (defined at \site-packages\object_detection\meta_architectures\faster_rcnn_meta_arch.py:1346) ]]
         [[Loss/RPNLoss/BalancedPositiveNegativeSampler/Cast_8/_192]]
  (1) Unknown:  Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.
         [[node model/conv1_conv/Conv2D (defined at \site-packages\object_detection\meta_architectures\faster_rcnn_meta_arch.py:1346) ]]
0 successful operations.
0 derived errors ignored. [Op:__inference__dummy_computation_fn_16411]

Errors may have originated from an input operation.
Input Source operations connected to node model/conv1_conv/Conv2D:
 model/lambda/Pad (defined at \site-packages\object_detection\models\keras_models\resnet_v1.py:49)

Input Source operations connected to node model/conv1_conv/Conv2D:
 model/lambda/Pad (defined at \site-packages\object_detection\models\keras_models\resnet_v1.py:49)

Function call stack:
_dummy_computation_fn -> _dummy_computation_fn

How to solve this problem?

talonmies
  • 70,661
  • 34
  • 192
  • 269
  • 1
    There is no Tensorflow version which supports CUDA 11.2. The release notes for the version you are using clearly state exactly what versions are supported, if you take the time to read them – talonmies Jan 24 '21 at 10:58

1 Answers1

2

Could you please share your tensorflow version, I believe that tensorflow<=2.4 does not support cuda versions of higher than 10.1, so that might be causing the problem.

If you do have the correct versions for cuda and tensorflow then i suggest you to check out this: It suggested to allow memory growth on your gpu.

EDIT:

So appears that you do have the tensorflow 2.4, so what i recommend here is downgrading cuda to 10.1 and tensorflow to 2.3 as suggested by the author of the repository. Or if you insist on using tensorflow 2.4, you should still downgrade your cuda version to 11.0 as mentioned here, since tensorflow still does not provide support for cuda 11.2.

KiLJ4EdeN
  • 206
  • 2
  • 6
  • I installed tensorflow-gpu=2.4.1. How to solve? – user1050684 Jan 25 '21 at 03:14
  • I looked up the repo you are trying to use, i see that the author also stated that the code was tested with tensorflow 2.3, which means that they also used cuda 10.1. The choice would be to downgrade the cuda and tensorflow libraries. – KiLJ4EdeN Jan 25 '21 at 07:08
  • Thought I trained faster RCNN with cuda 10.1,cudnn 8.0.4 and tensorflow-gpu 2.3.0., loss is nan. How to solve the problem? – user1050684 Jan 29 '21 at 09:00