1

I am having problems executing a simple Tensorflow model that worked well yesterday. I suspect, the problem in its entirety relates to the error given

      Blas GEMM launch failed

In the console it says,

  tensorflow/core/common_runtime/gpu/gpu_util.cc:343] CPU->GPU Memcpy failed

My impression is that this may relate to my CUDA installation based on this

TensorFlow: Blas GEMM launch failed

however, I can't see how to run the simpleCUBLAS examples. I am completely new to CUDA.

I have 4 1080ti GPUs (Ubuntu 16.04, TensorFlow 1.3.0) and I have not identified any zombie processes taking up GPU memory. Any help is greatly appreciated.

talonmies
  • 70,661
  • 34
  • 192
  • 269
GhostRider
  • 2,109
  • 7
  • 35
  • 53
  • It can mean you ran out of memory. Try reducing batch size or making model smaller – Yaroslav Bulatov Sep 04 '17 at 16:27
  • Yaroslav. Many thanks. I don't think the code can be the issue. This model ran many many times without problems over the past few days. Also, I reduced the batch size to 1 and image size (it's a CNN). I think there is an issue with memory allocation for sure, but not due specifically to this model. I also have had "cuDNN cannot create handle error" (again suggesting a memory issue. Been stuck on this for 9 hours.... – GhostRider Sep 04 '17 at 16:48
  • `cuDNN cannot create handle` can also be caused by out of memory on GPU – Yaroslav Bulatov Sep 04 '17 at 16:58
  • Exactly. I agree with you, but it doesn't explain why a simple model with less than 100k parameters that trained efficiently one day, suddenly throws a memory error. I agree that the problem is related to memory - I'm just uncertain that the model is the issue. I thought I had some zombie processes but I don't. Thanks for your responses – GhostRider Sep 04 '17 at 17:36

1 Answers1

2

So I found the answer after days of going mad. I first ran this

I did this:

 cd /usr/local/cuda/samples/7_CUDALibraries/simpleCUBLAS
 make
 ./simpleCUBLAS

to check my CUBLAS installation. It returned CUBLAS INITIALIZATION FAILED!!!

So next I did this (based on advice)

 sudo rm -f ~/.nv

And it worked. Hope this saves someone else. Seems easy when you see it.

The other thing that is worth mentioning is that this problem also threw this error occasionally:

    tensorflow/stream_executor/cuda/cuda_dnn.cc:385] could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR
    tensorflow/stream_executor/cuda/cuda_dnn.cc:352] could not destroy cudnn handle: CUDNN_STATUS_BAD_PARAM
    tensorflow/core/kernels/conv_ops.cc:605] Check failed: stream->parent()->GetConvolveAlgorithms(&algorithms) 

This was cryptic - everybody suggested it was a memory issue and sure enough, my GPUs got hogged by python during the initiation of my TF model. But it was the CUBLAS error that led me to the solution.

GhostRider
  • 2,109
  • 7
  • 35
  • 53