46

I installed tensorflow 1.0.1 GPU version on my Macbook Pro with GeForce GT 750M. Also installed CUDA 8.0.71 and cuDNN 5.1. I am running a tf code that works fine with non CPU tensorflow but on GPU version, I get this error (once a while it works too):

name: GeForce GT 750M
major: 3 minor: 0 memoryClockRate (GHz) 0.9255
pciBusID 0000:01:00.0
Total memory: 2.00GiB
Free memory: 67.48MiB
I tensorflow/core/common_runtime/gpu/gpu_device.cc:906] DMA: 0 
I tensorflow/core/common_runtime/gpu/gpu_device.cc:916] 0:   Y 
I tensorflow/core/common_runtime/gpu/gpu_device.cc:975] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GeForce GT 750M, pci bus id: 0000:01:00.0)
E tensorflow/stream_executor/cuda/cuda_driver.cc:1002] failed to allocate 67.48M (70754304 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY
Training...

E tensorflow/stream_executor/cuda/cuda_dnn.cc:397] could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR
E tensorflow/stream_executor/cuda/cuda_dnn.cc:364] could not destroy cudnn handle: CUDNN_STATUS_BAD_PARAM
F tensorflow/core/kernels/conv_ops.cc:605] Check failed: stream->parent()->GetConvolveAlgorithms(&algorithms) 
Abort trap: 6

What is happening here? Is this a bug in tensorflow. Please help.

Here are GPU memory space when I run the python code:

Device 0 [PCIe 0:1:0.0]: GeForce GT 750M (CC 3.0): 83.477 of 2047.6 MB (i.e. 4.08%) Free
MacBook-Pro:cuda-smi-master xxxxxx$ ./cuda-smi
Device 0 [PCIe 0:1:0.0]: GeForce GT 750M (CC 3.0): 83.477 of 2047.6 MB (i.e. 4.08%) Free
MacBook-Pro:cuda-smi-master xxxxxx$ ./cuda-smi
Device 0 [PCIe 0:1:0.0]: GeForce GT 750M (CC 3.0): 83.477 of 2047.6 MB (i.e. 4.08%) Free
MacBook-Pro:cuda-smi-master xxxxxx$ ./cuda-smi
Device 0 [PCIe 0:1:0.0]: GeForce GT 750M (CC 3.0): 1.1016 of 2047.6 MB (i.e. 0.0538%) Free
MacBook-Pro:cuda-smi-master xxxxxx$ ./cuda-smi
Device 0 [PCIe 0:1:0.0]: GeForce GT 750M (CC 3.0): 1.1016 of 2047.6 MB (i.e. 0.0538%) Free
MacBook-Pro:cuda-smi-master xxxxxx$ ./cuda-smi
Device 0 [PCIe 0:1:0.0]: GeForce GT 750M (CC 3.0): 1.1016 of 2047.6 MB (i.e. 0.0538%) Free
MacBook-Pro:cuda-smi-master xxxxxx$ ./cuda-smi
Device 0 [PCIe 0:1:0.0]: GeForce GT 750M (CC 3.0): 1.1016 of 2047.6 MB (i.e. 0.0538%) Free
MacBook-Pro:cuda-smi-master xxxxxx$ ./cuda-smi
Device 0 [PCIe 0:1:0.0]: GeForce GT 750M (CC 3.0): 91.477 of 2047.6 MB (i.e. 4.47%) Free
MacBook-Pro:cuda-smi-master xxxxxx$ ./cuda-smi
Device 0 [PCIe 0:1:0.0]: GeForce GT 750M (CC 3.0): 22.852 of 2047.6 MB (i.e. 1.12%) Free
MacBook-Pro:cuda-smi-master xxxxxx$ ./cuda-smi
Device 0 [PCIe 0:1:0.0]: GeForce GT 750M (CC 3.0): 22.852 of 2047.6 MB (i.e. 1.12%) Free
MacBook-Pro:cuda-smi-master xxxxxx$ ./cuda-smi
Device 0 [PCIe 0:1:0.0]: GeForce GT 750M (CC 3.0): 36.121 of 2047.6 MB (i.e. 1.76%) Free
MacBook-Pro:cuda-smi-master xxxxxx$ ./cuda-smi
Device 0 [PCIe 0:1:0.0]: GeForce GT 750M (CC 3.0): 71.477 of 2047.6 MB (i.e. 3.49%) Free
MacBook-Pro:cuda-smi-master xxxxxx$ ./cuda-smi
Device 0 [PCIe 0:1:0.0]: GeForce GT 750M (CC 3.0): 67.477 of 2047.6 MB (i.e. 3.3%) Free
MacBook-Pro:cuda-smi-master xxxxxx$ ./cuda-smi
Device 0 [PCIe 0:1:0.0]: GeForce GT 750M (CC 3.0): 67.477 of 2047.6 MB (i.e. 3.3%) Free
MacBook-Pro:cuda-smi-master xxxxxx$ ./cuda-smi
Device 0 [PCIe 0:1:0.0]: GeForce GT 750M (CC 3.0): 67.477 of 2047.6 MB (i.e. 3.3%) Free
Shimano
  • 785
  • 1
  • 6
  • 13
  • please post your nvidia GPU util & memory figures. I'm guessing you're out of GPU memory. – ruoho ruotsi Apr 01 '17 at 03:55
  • How do I check this please? Thanks – Shimano Apr 02 '17 at 03:41
  • on Linux i use 'nvidia-smi' but on macos this doesn't exist. Try this: https://github.com/phvu/cuda-smi – ruoho ruotsi Apr 02 '17 at 04:13
  • 1
    It initially seemed like lack of space but I tried again after restart and there was space. Here is the terminal output. (https://pastebin.com/9D2983ex) – Shimano Apr 02 '17 at 10:13
  • okay, if these is your issue (or you), hopefully the tensorflow guys can provide some insight: https://github.com/tensorflow/tensorflow/issues/8879 – ruoho ruotsi Apr 02 '17 at 20:53
  • Thanks for your help. I posted it as a tensorflow issue – Shimano Apr 03 '17 at 07:18
  • I have the exact same setup (MBP w/750M GPU). I was able to resolve this error by downgrading the CUDA driver from 8.083 to 8.0.46. I'm running tensorflow-gpu 1.1.0, (tensorflow 1.0.0 is also installed, but GPU version running). My setup also will occasionally fault if I haven't freed memory on the GPU. – anon01 May 21 '17 at 17:55

23 Answers23

45

In Tensorflow 2.0, my issue was resolved by setting the memory growth. ConfigProto is deprecated in TF 2.0, I used tf.config.experimental. My computer specs are:

  • OS: Ubuntu 18.04
  • GPU: GeForce RTX 2070
  • Nvidia Driver: 430.26
  • Tensorflow: 2.0
  • Cudnn: 7.6.2
  • Cuda: 10.0

The code I used was:

physical_devices = tf.config.experimental.list_physical_devices('GPU')
assert len(physical_devices) > 0, "Not enough GPU hardware devices available"
config = tf.config.experimental.set_memory_growth(physical_devices[0], True)
aveevu
  • 767
  • 6
  • 15
  • I'm using Pop!_OS 18.04, with a NVIDIA GeForce RTX 2080, CUDA 10.0, cuDNN 7.6.0 and TensorFlow 2.0. I only had this issue when using Jupyter Lab. So, probably, the kernel memory footprint is limited and hence we have to set memory growth. When I train models straight from Python code, it works fine. Anyways, to use Jupyter notebooks is a cool thing to have and this answer helped. – Ekho Nov 26 '19 at 18:25
  • This helped for the exact same configuration. I have GPU: GeForce RTX 2070 Super. Very useful. – feradz Dec 22 '19 at 00:10
  • I have the error message: Physical devices cannot be modified after being initialized – cloudscomputes Aug 31 '21 at 15:09
  • Solved also in my case with a Geforce RTX 3070 on Fedora OS and tensorflow 2.4. – Davide Nov 16 '21 at 08:58
29

I have managed to get it working by deleting the .nv folder in my home folder:

sudo rm -rf ~/.nv/
Félix Fu
  • 447
  • 4
  • 4
  • 2
    Don't know how this is happening, but this solution solved my problem too! – Anoop K. Prabhu Apr 23 '18 at 15:20
  • 3
    The directory is ~/.nv I believe it is caching some binaries and when you update cudnn header files, the old binaries are still fetched from the cache, and that is one cause for this issue. – dgumo Apr 28 '18 at 14:40
  • 5
    This solved my problem, but you need to run without sudo. – seleucia Sep 26 '18 at 17:58
23

In my case, after checking the cuDNN and CUDA version, I found my GPU was out of memory. Using watch -n 0.1 nvidia-smi in another bash terminal, the moment 2019-07-16 19:54:05.122224: E tensorflow/stream_executor/cuda/cuda_dnn.cc:334] Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR onset is the moment GPU memory nearly full. The screenshot

So I configure a limit for tnsorflow to use my gpu. As I use tf.keras module, I add the following code to the beginning of my program:

config = tf.ConfigProto()
config.gpu_options.per_process_gpu_memory_fraction = 0.9
tf.keras.backend.set_session(tf.Session(config=config));

Then, problem solved!

You can change your batch_size or using smarter ways to input your training data (such as tf.data.Dataset and using cache). I hope my answer can help someone else.

Zhao Kuangshi
  • 231
  • 2
  • 3
12

Adding following code worked for me:

config = tf.ConfigProto()
config.gpu_options.allow_growth = True
sess = tf.Session(config=config)

In my env there is no mismatch between CuDNN and Cuda versions. OS: ubuntu-18.04; Tensorflow: 1.14; CuDNN: 7.6; cuda: 10.1 (418.87.00).

Neeraj Jain
  • 598
  • 10
  • 10
9

To me, the 4th can nicely solve the problem. https://blog.csdn.net/comway_Li/article/details/102953634?utm_medium=distribute.pc_relevant.none-task-blog-baidujs-2

1.
    config = tf.ConfigProto()
    config.gpu_options.per_process_gpu_memory_fraction = 1.0
    session = tf.Session(config=config, ...)

2.
    config = tf.ConfigProto() 
    config.gpu_options.allow_growth = True 
    sess = tf.Session(config=config)

3.
    sudo rm -f ~/.nv 

4.
    from tensorflow.compat.v1 import ConfigProto
    from tensorflow.compat.v1 import InteractiveSession
    #from tensorflow import ConfigProto
    #from tensorflow import InteractiveSession
    config = ConfigProto()
    config.gpu_options.allow_growth = True
    session = InteractiveSession(config=config)
Qinyu Chen
  • 91
  • 1
  • 1
7

As strange as this may sound, try restarting your computer and rerun your model. If the model runs fine the issue is with your GPU memory allocation and tensorflows management of that available memory. On windows 10 i had two terminals open and closing one solved my problem. There could be open threads (zombie) that are still holding memory.

Kenan
  • 13,156
  • 8
  • 43
  • 50
  • The solution in my case. The error occurred even though I didn't make any changes to my files and before that everything worked, that's why I was skeptical about solutions that include adding files or lines to my existing code. – Serjuice Jul 19 '22 at 21:00
6

This works for me:

export TF_FORCE_GPU_ALLOW_GROWTH='true'

goodcow
  • 4,495
  • 6
  • 33
  • 52
  • 1
    Because tensorflow, by default, preallocates memory (I don't know how much) and this causes it to quickly run out of memory, particularly if you are running many jobs at once. This allows the memory to grow – goodcow Dec 10 '20 at 18:24
  • is there a way to set this permanently? I noticed I have to rerun it when i reboot – Kenan Dec 10 '20 at 18:52
  • You could put it in your `.bashrc` or call it through your script with `subprocess` – goodcow Dec 10 '20 at 23:55
  • i wonder if there will be any issue with placing it in the bashrc, I'll do that – Kenan Dec 11 '20 at 00:24
  • I have this same issue and I was able to resolve only by setting this ENV variable in my .zshrc but I think this is really weird !! I was able to run ai-benchmark but with several warning regarding GPU RAM full. Also the Tensorflow CNN [example](https://www.tensorflow.org/tutorials/images/cnn) could run only by setting TF_FORCE_GPU_ALLOW_GROWTH='true'. In this case I notice that nvidia-smi reported only a low portiion of RAM used, otherwise it show the used RAM almost full. – EanX Feb 08 '21 at 10:54
  • This saved my day! I was struggling with this with a RTX 2070 – charlie Dec 02 '21 at 21:14
5

For anyone getting this issue in Jupyter notebook:

I was running two jupyter notebooks. After closing one of them the issue was solved.

3

Try this

gpus = tf.config.experimental.list_physical_devices('GPU')
for gpu in gpus:
    tf.config.experimental.set_memory_growth(gpu, True)
tf.config.experimental.set_virtual_device_configuration(gpus[0], [tf.config.experimental.VirtualDeviceConfiguration(memory_limit=1024)])
Diego Ferri
  • 2,657
  • 2
  • 27
  • 35
Shekhrozx
  • 340
  • 1
  • 6
  • 15
2

I also get same error, and I resolved the issue. My system properties were as follows:

  • Operating System: Ubuntu 14.04
  • GPU: GTX 1050Ti
  • Nvidia Driver: 375.66
  • Tensorflow: 1.3.0
  • Cudnn: 6.0.21 (cudnn-8.0-linux-x64-v6.0.deb)
  • Cuda: 8.0.61
  • Keras: 2.0.8

How I solved the issue is as follows:

  1. I copied cudnn files to appropriate locations (/usr/local/cuda/include and /usr/local/cuda/lib64)

  2. I set the environment variables as:

    * export LD_LIBRARY_PATH="$LD_LIBRARY_PATH:/usr/local/cuda/lib64"
    * export CUDA_HOME=/usr/local/cuda
    
  3. I also run sudo ldconfig -v command to cache the shared libraries for run time linker.

Tomerikoo
  • 18,379
  • 16
  • 47
  • 61
2

In my case it seems that the problem was caused by tensorflow and cudnn version mismatch. The following helped me (I was working on Ubuntu 16.04 with NVidia Tesla K80 on Google Cloud, tensorflow 1.5 finally worked with cudnn 7.0.4 and cuda 9.0):

  1. Remove cuDNN completely:

    sudo rm /usr/local/cuda/include/cudnn.h
    sudo rm /usr/local/cuda/lib64/libcudnn*
    

After doing so import tensorflow should cause error.

  1. Download appropriate cuDNN version. Note that there is cuDNN 7.0.4 for CUDA 9.0 and cuDNN 7.0.4 for CUDA 8.0. You should choose the one corresponding to your CUDA version. Be careful at this step or you'll get similar problem again. Install cuDNN as usual:

    tar -xzvf cudnn-9.0-linux-x64-v7.tgz
    cd cuda
    sudo cp -P include/cudnn.h /usr/include
    sudo cp -P lib64/libcudnn* /usr/lib/x86_64-linux-gnu/
    sudo chmod a+r /usr/lib/x86_64-linux-gnu/libcudnn*
    

In this example I've installed cuDNN 7.0.x for CUDA 9.0 (x actually doesn't matter). Take care to match your CUDA version.

  1. Restart the computer. In my case the problem vanished. If the error still occurs consider installing another version of tensorflow.
Tomerikoo
  • 18,379
  • 16
  • 47
  • 61
Daniel Savenkov
  • 343
  • 2
  • 11
1

This is cudnn compatible issue. Check what you installed that is using the GPU for instance, tensorflow-gpu. What is the version? Is the version compatible with the version of your cudnn and is the cudnn installed the right version for your cuda?.

I have observed that: cuDNN v7.0.3 for Cuda 7.* cuDNN v7.1.2 for Cuda 9.0 cuDNN v7.3.1 for Cuda 9.1 and so on.

So also check the correct version of TensorFlow for your cuda configurations. For instance -using tensorflow-gpu: TF v1.4 for cudnn 7.0.* TF v1.7 and above for cudnn 9.0.*, etc.

So all you need to do is to reinstall the appropriate cudnn version. Hope it helps!

Nwoye CID
  • 834
  • 8
  • 8
1

Please remember to close your tensorboard terminal/cmd or other terminals, that have interactions to/with the directory. Then you can restart the training at it should work.

JTIM
  • 2,774
  • 1
  • 34
  • 74
1

It has to do with the memory fraction available to load GPU resources to create cudnn handle, also known as per_process_gpu_memory_fraction. Reducing this memory fraction by yourself will solve the error.

> sess_config = tf.ConfigProto(gpu_options =
> tf.GPUOptions(per_process_gpu_memory_fraction=0.7),
> allow_soft_placement = True)
> 
> with tf.Session(config=sess_config) as sess:
>      sess.run([whatever])

Use as small fraction as could fit in your memory. (In the code, I use 0.7, you can start with 0.3 or even smaller, then increase until you get the same error, that's your limit.) Pass it to your tf.Session() or tf.train.MonitoredTrainingSession() or Supervisor's sv.managed_session() as config.

This should allow your GPU create a cudnn handle for your TensorFlow code.

Nwoye CID
  • 834
  • 8
  • 8
1

I had the same problem (Ubuntu 18.04). I was using:

  • tensorflow 2.1
  • cuda 10.1
  • cudnn 7.6.5

I solved it uninstalling cuda and its folder, and installing it via apt following the tensorflow page instructions: https://www.tensorflow.org/install/gpu?hl=fr#ubuntu_1804_cuda_101

Josmar
  • 11
  • 1
1

I solved this problem by adjusting the GPU memory usage using the following lines:

config = tf.compat.v1.ConfigProto()
config.gpu_options.per_process_gpu_memory_fraction = 0.7
tf.compat.v1.keras.backend.set_session(
    tf.compat.v1.Session(config=config))

This works for TensorFlow 2.

singrium
  • 2,746
  • 5
  • 32
  • 45
  • 1
    in tensorFlow2 it should be done like: `physical_devices = tf.config.experimental.list_physical_devices('GPU') tf.config.experimental.set_memory_growth(physical_devices[0], True)` – Hugo Jun 14 '21 at 19:27
1

I had the same problem and solved it by adding:

import os
os.environ['TF_FORCE_GPU_ALLOW_GROWTH'] = 'true'
0

I too encountered the same problem:

Using TensorFlow backend.
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcublas.so locally
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcudnn.so locally
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcufft.so locally
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcuda.so.1 locally
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcurand.so locally
I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
I tensorflow/core/common_runtime/gpu/gpu_device.cc:885] Found device 0 with properties: 
name: GeForce GTX 1050
major: 6 minor: 1 memoryClockRate (GHz) 1.493 pciBusID 0000:01:00.0
Total memory: 3.95GiB
Free memory: 3.60GiB
I tensorflow/core/common_runtime/gpu/gpu_device.cc:906] DMA: 0 
I tensorflow/core/common_runtime/gpu/gpu_device.cc:916] 0:   Y 
I tensorflow/core/common_runtime/gpu/gpu_device.cc:975] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GeForce GTX 1050, pci bus id: 0000:01:00.0)
E tensorflow/stream_executor/cuda/cuda_dnn.cc:385] could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR
E tensorflow/stream_executor/cuda/cuda_dnn.cc:352] could not destroy cudnn handle: CUDNN_STATUS_BAD_PARAM
F tensorflow/core/kernels/conv_ops.cc:532] Check failed:  stream->parent()->GetConvolveAlgorithms(&algorithms)

Aborted (core dumped)

But in my case using sudo with the command worked perfectly fine.

MBT
  • 21,733
  • 19
  • 84
  • 102
Abhay
  • 163
  • 1
  • 4
0

I encountered this problem when I accidently installed the CUDA 9.2 libcudnn7_7.2.1.38-1+cuda9.2_amd64.deb instead of libcudnn7_7.0.5.15-1+cuda9.0_amd64.deb on a system with CUDA 9.0 installed.

I got there because I had CUDA 9.2 installed and I had downgraded to CUDA 9.0, and evidently libcudnn is specific to versions.

jrounds
  • 73
  • 1
  • 8
0

For me, re-running the CUDA installation as described here solved the problem:

# Add NVIDIA package repository
sudo apt-key adv --fetch-keys http://developer.download.nvidia.com/compute/cuda/repos/ubuntu1604/x86_64/7fa2af80.pub
wget http://developer.download.nvidia.com/compute/cuda/repos/ubuntu1604/x86_64/cuda-repo-ubuntu1604_9.1.85-1_amd64.deb
sudo apt install ./cuda-repo-ubuntu1604_9.1.85-1_amd64.deb
wget http://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1604/x86_64/nvidia-machine-learning-repo-ubuntu1604_1.0.0-1_amd64.deb
sudo apt install ./nvidia-machine-learning-repo-ubuntu1604_1.0.0-1_amd64.deb
sudo apt update

# Install CUDA and tools. Include optional NCCL 2.x
sudo apt install cuda9.0 cuda-cublas-9-0 cuda-cufft-9-0 cuda-curand-9-0 \
    cuda-cusolver-9-0 cuda-cusparse-9-0 libcudnn7=7.2.1.38-1+cuda9.0 \
    libnccl2=2.2.13-1+cuda9.0 cuda-command-line-tools-9-0

During the installation apt-get downgraded cudnn7 which I think is the culprit here. Probably it got updated accidentally with apt-get upgrade to a version which is incompatible with some other piece of the system.

Francesco Pasa
  • 511
  • 6
  • 14
0

I ran into the same problem because my GPU was running out of memory by some background zombie/terminated process, killing those processes works for me:

ps aux | grep 'Z' # Zombie
ps aux | grep 'T' # Terminated
kill -9 your_zombie_or_terminated_process_id
xtluo
  • 1,961
  • 18
  • 26
0

Rebooting the machine worked for me. Try this:

sudo reboot

Then, re-run the code

xKobalt
  • 1,498
  • 2
  • 13
  • 19
0

In my case, I had 2 GPUs and GPU=0 was busy with other model training. I was setting GPU 1 explicitly: os.environ["CUDA_VISIBLE_DEVICES"]="1"

I made the mistake of executing the above line after the model was created and just before training the model.

I solved this problem by including the above code on the top i.e after importing the libraries.

The problem was that once if the model assumes the GPUs that it can use(if you don't explicitly mention, it considers all the available GPUs), it would later not consider the code to explicitly use only one gpu.