17

when I successfully install tensorflow on cluster, I immediately running mnist demo to check if it's going well, but here I came up with a problem. I don't know what is this all about, but it looks like the error is coming from CUDA

python3 -m tensorflow.models.image.mnist.convolutional
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcublas.so locally
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcudnn.so locally
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcufft.so locally
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcuda.so locally
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcurand.so locally
Extracting data/train-images-idx3-ubyte.gz
Extracting data/train-labels-idx1-ubyte.gz
Extracting data/t10k-images-idx3-ubyte.gz
Extracting data/t10k-labels-idx1-ubyte.gz
I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:924] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
I tensorflow/core/common_runtime/gpu/gpu_init.cc:102] Found device 0 with properties: 
name: Tesla K20m
major: 3 minor: 5 memoryClockRate (GHz) 0.7055
pciBusID 0000:03:00.0
Total memory: 5.00GiB
Free memory: 4.92GiB
I tensorflow/core/common_runtime/gpu/gpu_init.cc:126] DMA: 0 
I tensorflow/core/common_runtime/gpu/gpu_init.cc:136] 0:   Y 
I tensorflow/core/common_runtime/gpu/gpu_device.cc:806] Creating TensorFlow device (/gpu:0) -> (device: 0, name: Tesla K20m, pci bus id: 0000:03:00.0)
Initialized!
E tensorflow/stream_executor/cuda/cuda_blas.cc:461] failed to run cuBLAS routine cublasSgemm_v2: CUBLAS_STATUS_EXECUTION_FAILED
Traceback (most recent call last):
  File "/home/gpuusr/local/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 715, in _do_call
return fn(*args)
  File "/home/gpuusr/local/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 697, in _run_fn
status, run_metadata)
  File "/home/gpuusr/local/lib/python3.5/contextlib.py", line 66, in __exit__
next(self.gen)
  File "/home/gpuusr/local/lib/python3.5/site-packages/tensorflow/python/framework/errors.py", line 450, in raise_exception_on_not_ok_status
pywrap_tensorflow.TF_GetCode(status))
tensorflow.python.framework.errors.InternalError: Blas SGEMM launch failed : a.shape=(64, 3136), b.shape=(3136, 512), m=64, n=512, k=3136
 [[Node: MatMul = MatMul[T=DT_FLOAT, transpose_a=false, transpose_b=false, _device="/job:localhost/replica:0/task:0/gpu:0"](Reshape, Variable_4/read)]]
 [[Node: add_5/_35 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/cpu:0", send_device="/job:localhost/replica:0/task:0/gpu:0", send_device_incarnation=1, tensor_name="edge_299_add_5", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"]()]]

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/gpuusr/local/lib/python3.5/runpy.py", line 170, in _run_module_as_main
"__main__", mod_spec)
  File "/home/gpuusr/local/lib/python3.5/runpy.py", line 85, in _run_code
exec(code, run_globals)
  File "/home/gpuusr/local/lib/python3.5/site-packages/tensorflow/models/image/mnist/convolutional.py", line 316, in <module>
tf.app.run()
  File "/home/gpuusr/local/lib/python3.5/site-packages/tensorflow/python/platform/app.py", line 30, in run
sys.exit(main(sys.argv))
  File "/home/gpuusr/local/lib/python3.5/site-packages/tensorflow/models/image/mnist/convolutional.py", line 294, in main
feed_dict=feed_dict)
  File "/home/gpuusr/local/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 372, in run
run_metadata_ptr)
  File "/home/gpuusr/local/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 636, in _run
feed_dict_string, options, run_metadata)
  File "/home/gpuusr/local/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 708, in _do_run
target_list, options, run_metadata)
  File "/home/gpuusr/local/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 728, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors.InternalError: Blas SGEMM launch failed : a.shape=(64, 3136), b.shape=(3136, 512), m=64, n=512, k=3136
 [[Node: MatMul = MatMul[T=DT_FLOAT, transpose_a=false, transpose_b=false, _device="/job:localhost/replica:0/task:0/gpu:0"](Reshape, Variable_4/read)]]
 [[Node: add_5/_35 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/cpu:0", send_device="/job:localhost/replica:0/task:0/gpu:0", send_device_incarnation=1, tensor_name="edge_299_add_5", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"]()]]
Caused by op 'MatMul', defined at:
  File "/home/gpuusr/local/lib/python3.5/runpy.py", line 170, in _run_module_as_main
"__main__", mod_spec)
  File "/home/gpuusr/local/lib/python3.5/runpy.py", line 85, in _run_code
exec(code, run_globals)
  File "/home/gpuusr/local/lib/python3.5/site-packages/tensorflow/models/image/mnist/convolutional.py", line 316, in <module>
tf.app.run()
  File "/home/gpuusr/local/lib/python3.5/site-packages/tensorflow/python/platform/app.py", line 30, in run
sys.exit(main(sys.argv))
  File "/home/gpuusr/local/lib/python3.5/site-packages/tensorflow/models/image/mnist/convolutional.py", line 221, in main
logits = model(train_data_node, True)
  File "/home/gpuusr/local/lib/python3.5/site-packages/tensorflow/models/image/mnist/convolutional.py", line 213, in model
hidden = tf.nn.relu(tf.matmul(reshape, fc1_weights) + fc1_biases)
  File "/home/gpuusr/local/lib/python3.5/site-packages/tensorflow/python/ops/math_ops.py", line 1209, in matmul
name=name)
  File "/home/gpuusr/local/lib/python3.5/site-packages/tensorflow/python/ops/gen_math_ops.py", line 1178, in _mat_mul
transpose_b=transpose_b, name=name)
  File "/home/gpuusr/local/lib/python3.5/site-packages/tensorflow/python/ops/op_def_library.py", line 704, in apply_op
op_def=op_def)
  File "/home/gpuusr/local/lib/python3.5/site-packages/tensorflow/python/framework/ops.py", line 2260, in create_op
original_op=self._default_original_op, op_def=op_def)
  File "/home/gpuusr/local/lib/python3.5/site-packages/tensorflow/python/framework/ops.py", line 1230, in __init__
self._traceback = _extract_stack()

Segmentation fault (core dumped)
talonmies
  • 70,661
  • 34
  • 192
  • 269
Pengqi Lu
  • 231
  • 1
  • 3
  • 5
  • 1
    In order to build or run TensorFlow with GPU support, both NVIDIA's Cuda Toolkit (>= 7.0) and cuDNN (>= v2) need to be installed. TensorFlow GPU support requires having a GPU card with NVidia Compute Capability >= 3.0. have you follow the officcial setup? https://www.tensorflow.org/versions/r0.9/get_started/os_setup.html – userfi Jul 11 '16 at 11:39
  • absolutely yes, my cuda version is 7.5 and cudnn version is v4 – Pengqi Lu Jul 11 '16 at 11:55
  • ok, and your graphics-card has capability greater or equal to 3.0? – userfi Jul 11 '16 at 12:01
  • My graphic cards is Nvidia Tesla K20m. I just looked up and found its cuda feature is 3.5(is it the compute capability?) from Nvidia website – Pengqi Lu Jul 11 '16 at 17:39
  • Does the access to cublas library required sudo authority? I remembered that I used pip3 install it without sudo prefix command – Pengqi Lu Jul 11 '16 at 17:42
  • Yes is the capabiloty, then the graphics cards should work. Try using sudo authority. What's your OS system? – userfi Jul 11 '16 at 18:26
  • My OS system is RadHat 6 – Pengqi Lu Jul 12 '16 at 08:01
  • Did you ever find a solution? I'm hitting this now. – clemej Oct 06 '16 at 01:42
  • 2
    @clemej Did you ever find a solution? *I'm* hitting this now – Stumbler May 02 '18 at 18:36

7 Answers7

24

This was a nightmare to find a fix for - but the fix is somewhat simple

https://www.tensorflow.org/guide/using_gpu

# add to the top of your code under import tensorflow as tf
config = tf.ConfigProto()
config.gpu_options.allow_growth = True
session = tf.Session(config=config....)
ElSheikh
  • 321
  • 6
  • 28
Linda MacPhee-Cobb
  • 7,646
  • 3
  • 20
  • 18
14

This problem re-surfaced for me using the latest stack (tensorflow 2.5, Cuda 11.1, Nvidia 3080). The fix above (as amended for Tensorflow 2) worked like a charm:

config = tf.compat.v1.ConfigProto()
config.gpu_options.allow_growth = True
session = tf.compat.v1.Session(config=config)
James Crawford
  • 141
  • 1
  • 2
5

The following two lines work for me. I copied it from github, but I have no idea what they mean.

physical_devices = tf.config.list_physical_devices('GPU')
tf.config.experimental.set_memory_growth(physical_devices[0], True)

Another way to do the same thing more simply is:

os.environ['TF_FORCE_GPU_ALLOW_GROWTH'] = 'true'

My environment: TF version: 2.6.0, CUDA version: 11.2 and GPU driver version: 460.32.03. I don't know what the version of cuDNN is because I can't find it.

zzzhhh
  • 291
  • 3
  • 10
4

I had exactly same error because in LD_LIBRARY_PATH I have cuda 5.5 in front of 7.5. After I moved 7.5 in front of 5.5 everything works fine now.

penglz
  • 41
  • 4
4

Aside from the mentioned solutions, this error also gets thrown when the CUBLAS version isn't compatible with the CUDA version. In my case, libclubas10 version 10.2.2.89-1 was incompatible with CUDA 10.1, so I had to downgrade:

sudo apt-get install libcublas10=10.2.1.243-1 libcublas-dev=10.2.1.243-1 cuda-libraries-10-1 cuda-libraries-dev-10-1
runDOSrun
  • 10,359
  • 7
  • 47
  • 57
  • Similarly, I got the same error because my cudnn version didn't match the CUDA version. I had libcudnn8=8.1.1.33-1 installed which didn't match cuda-11-0. – xel Mar 05 '21 at 12:34
0

Make sure to use sess.close() between each session to free the resources otherwise you'll have to kill the process in the task manager

0

The compatibility issue between CUDA version and TensorFlow version. In my case, My CUDA version is 10.0 and TensorFlow version is 2.1.0, and this issue occurs. After changing TensorFlow 2.1.0 to TensorFlow 2.0.0, this issue disappears.