2

I've done a lot of work with TF1 and recently I upgraded to TF2 but I'm running into issues with running TF2 on a GPU as the network isn't converging (even as the same code converges while running on a cpu). Following the CNN tutorials on https://www.tensorflow.org/tutorials I have noticed that the models are failing to learn during training. Any ideas on what is causing this?

Another posts suggested that this may be caused by floating point errors a but I have a hard time believing things are that unstable -- especially across multiple tutorials. I had this problem occur on the following tutorials: Convolutional Neural Network (CNN), Transfer learning and fine tuning, and Transfer learning with TF hub.

I am running:

  • Tensorflow version 2.3.0
  • Cuda compilation tools release 11.2 V11.2.125
  • On a NVIDIA GeForce RTX 3090 or Intel i7-10700K CPU
  • I had some trouble installing things initially but the method described in this answer ended up working -- could that be the root issue?

To demonstrate, I copy/pasted the code from the CNN tutorial into the following script:

import os
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3' # minimize logs

import tensorflow as tf
from tensorflow.keras import datasets, layers, models
import matplotlib.pyplot as plt

RUN_ON_CPU = False

if RUN_ON_CPU:
    os.environ['CUDA_VISIBLE_DEVICES'] = '-1'  # prevent cpu from running to see if that's the issue

print('gpu available', tf.config.list_physical_devices('GPU'))

# load dataset
(train_images, train_labels), (test_images, test_labels) = datasets.cifar10.load_data()
# Normalize pixel values to be between 0 and 1
train_images, test_images = train_images / 255.0, test_images / 255.0

# build model backbone
model = models.Sequential()
model.add(layers.Conv2D(32, (3, 3), activation='relu', input_shape=(32, 32, 3)))
model.add(layers.MaxPooling2D((2, 2)))
model.add(layers.Conv2D(64, (3, 3), activation='relu'))
model.add(layers.MaxPooling2D((2, 2)))
model.add(layers.Conv2D(64, (3, 3), activation='relu'))
# add dense layers on top
model.add(layers.Flatten())
model.add(layers.Dense(64, activation='relu'))
model.add(layers.Dense(10))
model.summary()

# compile and train
model.compile(optimizer='adam',
              loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
              metrics=['accuracy'])

history = model.fit(train_images, train_labels, epochs=10, 
                    validation_data=(test_images, test_labels))

plt.figure()
plt.plot(history.history['accuracy'], label='accuracy')
plt.plot(history.history['val_accuracy'], label = 'val_accuracy')
plt.xlabel('Epoch')
plt.ylabel('Accuracy')
# plt.ylim([0.5, 1])
plt.legend(loc='lower right')
if RUN_ON_CPU:
    plt.title('Training on CPU')
else:
    plt.title('Training on GPU')
test_loss, test_acc = model.evaluate(test_images,  test_labels, verbose=2)
print('test loss and accuracy', test_loss, test_acc)

Which plots the following training curves depending on the RUN_ON_CPU flag: curve on gpu enter image description here

GPU test loss and accuracy 2.302645444869995 0.10000000149011612

CPU test loss and accuracy 0.879743754863739 0.7060999870300293

The tutorial claims that the CNN should achieve a test accuracy of ~70% which the GPU doesn't come close to. To be sure I logged tf.config.list_physical_devices('GPU') and the GPU took 2-3s per epoch whereas CPU took 11-14s. Using os.environ['CUDA_VISIBLE_DEVICES'] = '-1' to turn off the GPU was the only code change between the runs.

talonmies
  • 70,661
  • 34
  • 192
  • 269
Brett S
  • 579
  • 4
  • 8
  • 3
    Not reproducible, with a RTX 3070 and cuda 11.3. I get the same timing as you, gpu around 6 times faster but my result are `GPU test loss and accuracy 0.8902135491371155 0.7114999890327454` and `CPU test loss and accuracy 0.897831916809082 0.7027000188827515` Your install is probably broken – phe Jul 22 '21 at 18:23
  • Thanks for the update @phe! Any idea on how to fix the install? – Brett S Jul 22 '21 at 19:59
  • I'm not familiar with conda, but first things to do is probably to check if [cuda examples code](https://github.com/NVIDIA/cuda-samples) works after you activate your conda tf 2.3 env. – phe Jul 22 '21 at 20:11
  • 1
    never mind my last comment, tensorflow added gpu support for Ampere arch in tf 2.4 and if you want to play with tensorcore hardware you need the latest relaese. – phe Jul 22 '21 at 20:26
  • Ok. I'm going to uninstall cuda 11.2 in control center and by deleting `C:\Program Files\NVIDIA GPU Computing Toolkit` and then reinstalling by following this tutorial https://www.youtube.com/watch?v=hHWkvEcDBO0 -- will update – Brett S Jul 22 '21 at 20:56
  • It worked! I will add an answer with a more in-depth walkthrough of what I did – Brett S Jul 22 '21 at 21:22

1 Answers1

3

Ok I got it working, thanks to @phe who suggested that my installation was faulty in a comment.

Here's what I did:

Uninstalled CUDA

  1. In control panel I uninstalled everything CUDA 11.2
  2. I deleted the folder at because the uninstallers didn't C:\Program Files\NVIDIA GPU Computing Toolkit

Re-install CUDA and cudNN (following this video)

  1. Using this table I determined which version of CUDA and cudNN to install https://www.tensorflow.org/install/source#gpu
  2. Download the appropriate CUDA and cudNN versions
  3. Install CUDA
  4. Extract downloaded cudNN zip and copy the bin, include, and lib subfolders to where CUDA installed (for me this was C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.2) this should put the items from cudNN into CUDA's same folders
  5. Add the libnvvp and bin CUDA folders to PATH Environmental variables (for me C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.2\bin and C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.2\libnvvp)

Create an python environment with tensorflow

  1. Install Anaconda
  2. Create an environment with an appropriate python version (see table from step 1 of installing CUDA): conda create -n tf25 python=3.8
  3. Activate the environment conda activate tf25
  4. Install tensorflow using pip (don't use anaconda -- I think this is where my system got messed up): pip install tensorflow (specify a specific version if you don't want the most recent one)
  5. Run your code using that environment
  6. Invent AGI (or don't if you want to prevent the apocalypse :)
Brett S
  • 579
  • 4
  • 8