2

I have a system with Ubuntu20.04 installed, hence getting correct combinations of CUDA and cudnn for Tensorflow seems a little tricky. I tried CUDA11 but I couldn't get cudnn working, so I installed CUDA10.1 via sudo apt install nvidia-cuda-toolkit and the corresponding cudnn (7.6.5) (some helpful answers). Now when I installed Tensorflow-gpu 2, I could easily check if it was using GPUs as:

import tensorflow as tf
print("Num GPUs Available: ", len(tf.config.experimental.list_physical_devices('GPU'))) 

which gave correct output 2. However I need to use Tensorflow-gpu-1.15. And with this, I tried the following according to answer in this SO post:

import tensorflow as tf
with tf.device('/gpu:0'):
    a = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[2, 3], name='a')
    b = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[3, 2], name='b')
    c = tf.matmul(a, b)

with tf.Session() as sess:
    print (sess.run(c))

which gave the following output:

2020-07-11 14:05:53.181428: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-07-11 14:05:53.183404: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 0 with properties: 
name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.582
pciBusID: 0000:01:00.0
2020-07-11 14:05:53.183598: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-07-11 14:05:53.185222: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 1 with properties: 
name: GeForce GTX 1080 major: 6 minor: 1 memoryClockRate(GHz): 1.7335
pciBusID: 0000:02:00.0
2020-07-11 14:05:53.185548: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcudart.so.10.0'; dlerror: libcudart.so.10.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/lib/cuda/include:/usr/lib/cuda/lib64:
2020-07-11 14:05:53.185790: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcublas.so.10.0'; dlerror: libcublas.so.10.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/lib/cuda/include:/usr/lib/cuda/lib64:
2020-07-11 14:05:53.186015: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcufft.so.10.0'; dlerror: libcufft.so.10.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/lib/cuda/include:/usr/lib/cuda/lib64:
2020-07-11 14:05:53.186237: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcurand.so.10.0'; dlerror: libcurand.so.10.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/lib/cuda/include:/usr/lib/cuda/lib64:
2020-07-11 14:05:53.186459: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcusolver.so.10.0'; dlerror: libcusolver.so.10.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/lib/cuda/include:/usr/lib/cuda/lib64:
2020-07-11 14:05:53.186578: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcusparse.so.10.0'; dlerror: libcusparse.so.10.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/lib/cuda/include:/usr/lib/cuda/lib64:
2020-07-11 14:05:53.186594: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2020-07-11 14:05:53.186601: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1641] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and setup the required libraries for your platform.
Skipping registering GPU devices...
2020-07-11 14:05:53.187652: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1159] Device interconnect StreamExecutor with strength 1 edge matrix:
2020-07-11 14:05:53.187669: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1165]      
Traceback (most recent call last):
  File "/home/mo/anaconda3/envs/lf/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1365, in _do_call
return fn(*args)
  File "/home/mo/anaconda3/envs/lf/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1348, in _run_fn
self._extend_graph()
  File "/home/mo/anaconda3/envs/lf/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1388, in _extend_graph
    tf_session.ExtendSession(self._session)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Cannot assign a device for operation MatMul: {{node MatMul}} was explicitly assigned to /device:GPU:0 but available devices are [ /job:localhost/replica:0/task:0/device:CPU:0, /job:localhost/replica:0/task:0/device:XLA_CPU:0, /job:localhost/replica:0/task:0/device:XLA_GPU:0, /job:localhost/replica:0/task:0/device:XLA_GPU:1 ]. Make sure the device specification refers to a valid device.
 [[MatMul]]

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<stdin>", line 2, in <module>
  File "/home/mo/anaconda3/envs/lf/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 956, in run
run_metadata_ptr)
  File "/home/mo/anaconda3/envs/lf/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1180, in _run
feed_dict_tensor, options, run_metadata)
  File "/home/mo/anaconda3/envs/lf/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1359, in _do_run
run_metadata)
  File "/home/mo/anaconda3/envs/lf/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1384, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Cannot assign a device for operation MatMul: node MatMul (defined at /home/mo/anaconda3/envs/lf/lib/python3.6/site-packages/tensorflow_core/python/framework/ops.py:1748)  was explicitly assigned to /device:GPU:0 but available devices are [ /job:localhost/replica:0/task:0/device:CPU:0, /job:localhost/replica:0/task:0/device:XLA_CPU:0, /job:localhost/replica:0/task:0/device:XLA_GPU:0, /job:localhost/replica:0/task:0/device:XLA_GPU:1 ]. Make sure the device specification refers to a valid device.
 [[MatMul]]

I can't understand the problem, do I need a different CUDA Version (10.0 may be because of the missing libraries), or should I change the device name in tf.device, I'm not sure if CUDA10.0 can be installed on ubuntu20, so may be install older ubuntu version instead?

talonmies
  • 70,661
  • 34
  • 192
  • 269
momo
  • 1,052
  • 5
  • 16
  • 34

2 Answers2

2

I had similar issues getting CUDA to work. My solution was to downgrade to Ubuntu 18.04 and ensure I had the correct combinations of gcc, CUDA and tensorflow as listed in the Tested build configurations:

The original writeup of my solution was documented in this StackOverflow question:

James McGuigan
  • 7,542
  • 4
  • 26
  • 29
0

You have a problem with Cuda libraries such as libcublas, libcurand, etc. Try to set the LD_LIBRARY_PATH correctly and check what directories /usr/lib/cuda/include:/usr/lib/cuda/lib64 contains.

stlaza
  • 79
  • 5