Why are the models in the tutorials not converging on GPU (but working on CPU)?

Question

I've done a lot of work with TF1 and recently I upgraded to TF2 but I'm running into issues with running TF2 on a GPU as the network isn't converging (even as the same code converges while running on a cpu). Following the CNN tutorials on https://www.tensorflow.org/tutorials I have noticed that the models are failing to learn during training. Any ideas on what is causing this?

Another posts suggested that this may be caused by floating point errors a but I have a hard time believing things are that unstable -- especially across multiple tutorials. I had this problem occur on the following tutorials: Convolutional Neural Network (CNN), Transfer learning and fine tuning, and Transfer learning with TF hub.

I am running:

Tensorflow version 2.3.0
Cuda compilation tools release 11.2 V11.2.125
On a NVIDIA GeForce RTX 3090 or Intel i7-10700K CPU
I had some trouble installing things initially but the method described in this answer ended up working -- could that be the root issue?

To demonstrate, I copy/pasted the code from the CNN tutorial into the following script:

import os
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3' # minimize logs

import tensorflow as tf
from tensorflow.keras import datasets, layers, models
import matplotlib.pyplot as plt

RUN_ON_CPU = False

if RUN_ON_CPU:
    os.environ['CUDA_VISIBLE_DEVICES'] = '-1'  # prevent cpu from running to see if that's the issue

print('gpu available', tf.config.list_physical_devices('GPU'))

# load dataset
(train_images, train_labels), (test_images, test_labels) = datasets.cifar10.load_data()
# Normalize pixel values to be between 0 and 1
train_images, test_images = train_images / 255.0, test_images / 255.0

# build model backbone
model = models.Sequential()
model.add(layers.Conv2D(32, (3, 3), activation='relu', input_shape=(32, 32, 3)))
model.add(layers.MaxPooling2D((2, 2)))
model.add(layers.Conv2D(64, (3, 3), activation='relu'))
model.add(layers.MaxPooling2D((2, 2)))
model.add(layers.Conv2D(64, (3, 3), activation='relu'))
# add dense layers on top
model.add(layers.Flatten())
model.add(layers.Dense(64, activation='relu'))
model.add(layers.Dense(10))
model.summary()

# compile and train
model.compile(optimizer='adam',
              loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
              metrics=['accuracy'])

history = model.fit(train_images, train_labels, epochs=10, 
                    validation_data=(test_images, test_labels))

plt.figure()
plt.plot(history.history['accuracy'], label='accuracy')
plt.plot(history.history['val_accuracy'], label = 'val_accuracy')
plt.xlabel('Epoch')
plt.ylabel('Accuracy')
# plt.ylim([0.5, 1])
plt.legend(loc='lower right')
if RUN_ON_CPU:
    plt.title('Training on CPU')
else:
    plt.title('Training on GPU')
test_loss, test_acc = model.evaluate(test_images,  test_labels, verbose=2)
print('test loss and accuracy', test_loss, test_acc)

Which plots the following training curves depending on the RUN_ON_CPU flag:

GPU test loss and accuracy 2.302645444869995 0.10000000149011612

CPU test loss and accuracy 0.879743754863739 0.7060999870300293

The tutorial claims that the CNN should achieve a test accuracy of ~70% which the GPU doesn't come close to. To be sure I logged tf.config.list_physical_devices('GPU') and the GPU took 2-3s per epoch whereas CPU took 11-14s. Using os.environ['CUDA_VISIBLE_DEVICES'] = '-1' to turn off the GPU was the only code change between the runs.

Not reproducible, with a RTX 3070 and cuda 11.3. I get the same timing as you, gpu around 6 times faster but my result are `GPU test loss and accuracy 0.8902135491371155 0.7114999890327454` and `CPU test loss and accuracy 0.897831916809082 0.7027000188827515` Your install is probably broken — phe, Jul 22 '21 at 18:23
Thanks for the update @phe! Any idea on how to fix the install? — Brett S, Jul 22 '21 at 19:59
I'm not familiar with conda, but first things to do is probably to check if [cuda examples code](https://github.com/NVIDIA/cuda-samples) works after you activate your conda tf 2.3 env. — phe, Jul 22 '21 at 20:11
never mind my last comment, tensorflow added gpu support for Ampere arch in tf 2.4 and if you want to play with tensorcore hardware you need the latest relaese. — phe, Jul 22 '21 at 20:26
Ok. I'm going to uninstall cuda 11.2 in control center and by deleting `C:\Program Files\NVIDIA GPU Computing Toolkit` and then reinstalling by following this tutorial https://www.youtube.com/watch?v=hHWkvEcDBO0 -- will update — Brett S, Jul 22 '21 at 20:56
It worked! I will add an answer with a more in-depth walkthrough of what I did — Brett S, Jul 22 '21 at 21:22

score 3 · Answer 1 · answered Jul 22 '21 at 21:37

Ok I got it working, thanks to @phe who suggested that my installation was faulty in a comment.

Here's what I did:

Uninstalled CUDA

In control panel I uninstalled everything CUDA 11.2
I deleted the folder at because the uninstallers didn't C:\Program Files\NVIDIA GPU Computing Toolkit

Re-install CUDA and cudNN (following this video)

Using this table I determined which version of CUDA and cudNN to install https://www.tensorflow.org/install/source#gpu
Download the appropriate CUDA and cudNN versions
Install CUDA
Extract downloaded cudNN zip and copy the bin, include, and lib subfolders to where CUDA installed (for me this was C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.2) this should put the items from cudNN into CUDA's same folders
Add the libnvvp and bin CUDA folders to PATH Environmental variables (for me C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.2\bin and C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.2\libnvvp)

Create an python environment with tensorflow

Install Anaconda
Create an environment with an appropriate python version (see table from step 1 of installing CUDA): conda create -n tf25 python=3.8
Activate the environment conda activate tf25
Install tensorflow using pip (don't use anaconda -- I think this is where my system got messed up): pip install tensorflow (specify a specific version if you don't want the most recent one)
Run your code using that environment
Invent AGI (or don't if you want to prevent the apocalypse :)

Why are the models in the tutorials not converging on GPU (but working on CPU)?

1 Answers1