tensorflow running error with cublas

Question

when I successfully install tensorflow on cluster, I immediately running mnist demo to check if it's going well, but here I came up with a problem. I don't know what is this all about, but it looks like the error is coming from CUDA

python3 -m tensorflow.models.image.mnist.convolutional
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcublas.so locally
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcudnn.so locally
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcufft.so locally
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcuda.so locally
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcurand.so locally
Extracting data/train-images-idx3-ubyte.gz
Extracting data/train-labels-idx1-ubyte.gz
Extracting data/t10k-images-idx3-ubyte.gz
Extracting data/t10k-labels-idx1-ubyte.gz
I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:924] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
I tensorflow/core/common_runtime/gpu/gpu_init.cc:102] Found device 0 with properties: 
name: Tesla K20m
major: 3 minor: 5 memoryClockRate (GHz) 0.7055
pciBusID 0000:03:00.0
Total memory: 5.00GiB
Free memory: 4.92GiB
I tensorflow/core/common_runtime/gpu/gpu_init.cc:126] DMA: 0 
I tensorflow/core/common_runtime/gpu/gpu_init.cc:136] 0:   Y 
I tensorflow/core/common_runtime/gpu/gpu_device.cc:806] Creating TensorFlow device (/gpu:0) -> (device: 0, name: Tesla K20m, pci bus id: 0000:03:00.0)
Initialized!
E tensorflow/stream_executor/cuda/cuda_blas.cc:461] failed to run cuBLAS routine cublasSgemm_v2: CUBLAS_STATUS_EXECUTION_FAILED
Traceback (most recent call last):
  File "/home/gpuusr/local/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 715, in _do_call
return fn(*args)
  File "/home/gpuusr/local/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 697, in _run_fn
status, run_metadata)
  File "/home/gpuusr/local/lib/python3.5/contextlib.py", line 66, in __exit__
next(self.gen)
  File "/home/gpuusr/local/lib/python3.5/site-packages/tensorflow/python/framework/errors.py", line 450, in raise_exception_on_not_ok_status
pywrap_tensorflow.TF_GetCode(status))
tensorflow.python.framework.errors.InternalError: Blas SGEMM launch failed : a.shape=(64, 3136), b.shape=(3136, 512), m=64, n=512, k=3136
 [[Node: MatMul = MatMul[T=DT_FLOAT, transpose_a=false, transpose_b=false, _device="/job:localhost/replica:0/task:0/gpu:0"](Reshape, Variable_4/read)]]
 [[Node: add_5/_35 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/cpu:0", send_device="/job:localhost/replica:0/task:0/gpu:0", send_device_incarnation=1, tensor_name="edge_299_add_5", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"]()]]

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/gpuusr/local/lib/python3.5/runpy.py", line 170, in _run_module_as_main
"__main__", mod_spec)
  File "/home/gpuusr/local/lib/python3.5/runpy.py", line 85, in _run_code
exec(code, run_globals)
  File "/home/gpuusr/local/lib/python3.5/site-packages/tensorflow/models/image/mnist/convolutional.py", line 316, in <module>
tf.app.run()
  File "/home/gpuusr/local/lib/python3.5/site-packages/tensorflow/python/platform/app.py", line 30, in run
sys.exit(main(sys.argv))
  File "/home/gpuusr/local/lib/python3.5/site-packages/tensorflow/models/image/mnist/convolutional.py", line 294, in main
feed_dict=feed_dict)
  File "/home/gpuusr/local/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 372, in run
run_metadata_ptr)
  File "/home/gpuusr/local/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 636, in _run
feed_dict_string, options, run_metadata)
  File "/home/gpuusr/local/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 708, in _do_run
target_list, options, run_metadata)
  File "/home/gpuusr/local/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 728, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors.InternalError: Blas SGEMM launch failed : a.shape=(64, 3136), b.shape=(3136, 512), m=64, n=512, k=3136
 [[Node: MatMul = MatMul[T=DT_FLOAT, transpose_a=false, transpose_b=false, _device="/job:localhost/replica:0/task:0/gpu:0"](Reshape, Variable_4/read)]]
 [[Node: add_5/_35 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/cpu:0", send_device="/job:localhost/replica:0/task:0/gpu:0", send_device_incarnation=1, tensor_name="edge_299_add_5", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"]()]]
Caused by op 'MatMul', defined at:
  File "/home/gpuusr/local/lib/python3.5/runpy.py", line 170, in _run_module_as_main
"__main__", mod_spec)
  File "/home/gpuusr/local/lib/python3.5/runpy.py", line 85, in _run_code
exec(code, run_globals)
  File "/home/gpuusr/local/lib/python3.5/site-packages/tensorflow/models/image/mnist/convolutional.py", line 316, in <module>
tf.app.run()
  File "/home/gpuusr/local/lib/python3.5/site-packages/tensorflow/python/platform/app.py", line 30, in run
sys.exit(main(sys.argv))
  File "/home/gpuusr/local/lib/python3.5/site-packages/tensorflow/models/image/mnist/convolutional.py", line 221, in main
logits = model(train_data_node, True)
  File "/home/gpuusr/local/lib/python3.5/site-packages/tensorflow/models/image/mnist/convolutional.py", line 213, in model
hidden = tf.nn.relu(tf.matmul(reshape, fc1_weights) + fc1_biases)
  File "/home/gpuusr/local/lib/python3.5/site-packages/tensorflow/python/ops/math_ops.py", line 1209, in matmul
name=name)
  File "/home/gpuusr/local/lib/python3.5/site-packages/tensorflow/python/ops/gen_math_ops.py", line 1178, in _mat_mul
transpose_b=transpose_b, name=name)
  File "/home/gpuusr/local/lib/python3.5/site-packages/tensorflow/python/ops/op_def_library.py", line 704, in apply_op
op_def=op_def)
  File "/home/gpuusr/local/lib/python3.5/site-packages/tensorflow/python/framework/ops.py", line 2260, in create_op
original_op=self._default_original_op, op_def=op_def)
  File "/home/gpuusr/local/lib/python3.5/site-packages/tensorflow/python/framework/ops.py", line 1230, in __init__
self._traceback = _extract_stack()

Segmentation fault (core dumped)

In order to build or run TensorFlow with GPU support, both NVIDIA's Cuda Toolkit (>= 7.0) and cuDNN (>= v2) need to be installed. TensorFlow GPU support requires having a GPU card with NVidia Compute Capability >= 3.0. have you follow the officcial setup? https://www.tensorflow.org/versions/r0.9/get_started/os_setup.html — userfi, Jul 11 '16 at 11:39
absolutely yes, my cuda version is 7.5 and cudnn version is v4 — Pengqi Lu, Jul 11 '16 at 11:55
ok, and your graphics-card has capability greater or equal to 3.0? — userfi, Jul 11 '16 at 12:01
My graphic cards is Nvidia Tesla K20m. I just looked up and found its cuda feature is 3.5(is it the compute capability?) from Nvidia website — Pengqi Lu, Jul 11 '16 at 17:39
Does the access to cublas library required sudo authority? I remembered that I used pip3 install it without sudo prefix command — Pengqi Lu, Jul 11 '16 at 17:42
Yes is the capabiloty, then the graphics cards should work. Try using sudo authority. What's your OS system? — userfi, Jul 11 '16 at 18:26
@clemej Did you ever find a solution? *I'm* hitting this now — Stumbler, May 02 '18 at 18:36

score 24 · Answer 1 · edited Apr 18 '19 at 04:37

24

This was a nightmare to find a fix for - but the fix is somewhat simple

https://www.tensorflow.org/guide/using_gpu

# add to the top of your code under import tensorflow as tf
config = tf.ConfigProto()
config.gpu_options.allow_growth = True
session = tf.Session(config=config....)

edited Apr 18 '19 at 04:37

ElSheikh

321
6
28

answered Sep 01 '18 at 21:30

Linda MacPhee-Cobb

7,646
3
20
18

Worked for me on Keras/tf <3 – captain Mar 21 '19 at 18:49
1

Worked on pip install of tensorflow-gpu, but configuring the `gpu_options` was not necessary, only passing configproto init to session. – Free Url Mar 25 '19 at 01:08
1

For TensorFlow 2, use ```tf.compat.v1.ConfigProto``` and ```tf.compat.v1.Session``` instead of the ones mentioned in the answer. – Gautam Sreekumar Feb 14 '20 at 19:53
4

What is intended to go into the ellipse in your answer? As far as I'm aware, `tf.Session(config=config....)` is not valid Python. – AmphotericLewisAcid Feb 18 '20 at 01:33
Can someone try to solve: https://stackoverflow.com/questions/60766376/keras-and-tensorflow-gradients-for-a-complex-custom-loss-function – SheppLogan Mar 20 '20 at 22:23

score 14 · Answer 2 · answered Nov 29 '20 at 18:40

This problem re-surfaced for me using the latest stack (tensorflow 2.5, Cuda 11.1, Nvidia 3080). The fix above (as amended for Tensorflow 2) worked like a charm:

config = tf.compat.v1.ConfigProto()
config.gpu_options.allow_growth = True
session = tf.compat.v1.Session(config=config)

zzzhhh · Answer 3 · 2021-10-11T04:18:44.600

5

The following two lines work for me. I copied it from github, but I have no idea what they mean.

physical_devices = tf.config.list_physical_devices('GPU')
tf.config.experimental.set_memory_growth(physical_devices[0], True)

Another way to do the same thing more simply is:

os.environ['TF_FORCE_GPU_ALLOW_GROWTH'] = 'true'

My environment: TF version: 2.6.0, CUDA version: 11.2 and GPU driver version: 460.32.03. I don't know what the version of cuDNN is because I can't find it.

edited Oct 11 '21 at 04:18

answered Oct 11 '21 at 04:04

zzzhhh

291
3
10

This worked for me for python 3.8.6 on Ubuntu 20.04. – james-see Dec 06 '21 at 01:46
This solved it for me. Using a tensorflow nightly container running on kubernetes. – nklsla Apr 24 '23 at 22:47

score 4 · Answer 4 · answered Dec 12 '16 at 16:09

4

I had exactly same error because in LD_LIBRARY_PATH I have cuda 5.5 in front of 7.5. After I moved 7.5 in front of 5.5 everything works fine now.

answered Dec 12 '16 at 16:09

penglz

41
4

score 4 · Answer 5 · answered Aug 27 '20 at 12:04

4

Aside from the mentioned solutions, this error also gets thrown when the CUBLAS version isn't compatible with the CUDA version. In my case, libclubas10 version 10.2.2.89-1 was incompatible with CUDA 10.1, so I had to downgrade:

sudo apt-get install libcublas10=10.2.1.243-1 libcublas-dev=10.2.1.243-1 cuda-libraries-10-1 cuda-libraries-dev-10-1

answered Aug 27 '20 at 12:04

runDOSrun

10,359
7
47
57

Similarly, I got the same error because my cudnn version didn't match the CUDA version. I had libcudnn8=8.1.1.33-1 installed which didn't match cuda-11-0. – xel Mar 05 '21 at 12:34

score 0 · Answer 6 · answered Aug 14 '19 at 16:26

0

Make sure to use sess.close() between each session to free the resources otherwise you'll have to kill the process in the task manager

answered Aug 14 '19 at 16:26

Tabarnacos

1

score 0 · Answer 7 · answered Apr 03 '20 at 15:28

0

The compatibility issue between CUDA version and TensorFlow version. In my case, My CUDA version is 10.0 and TensorFlow version is 2.1.0, and this issue occurs. After changing TensorFlow 2.1.0 to TensorFlow 2.0.0, this issue disappears.

answered Apr 03 '20 at 15:28

Lingfeng Zhang

19
1

tensorflow running error with cublas

7 Answers7

Linked