I've done a lot of work with TF1 and recently I upgraded to TF2 but I'm running into issues with running TF2 on a GPU as the network isn't converging (even as the same code converges while running on a cpu). Following the CNN tutorials on https://www.tensorflow.org/tutorials I have noticed that the models are failing to learn during training. Any ideas on what is causing this?
Another posts suggested that this may be caused by floating point errors a but I have a hard time believing things are that unstable -- especially across multiple tutorials. I had this problem occur on the following tutorials: Convolutional Neural Network (CNN), Transfer learning and fine tuning, and Transfer learning with TF hub.
I am running:
- Tensorflow version 2.3.0
- Cuda compilation tools release 11.2 V11.2.125
- On a NVIDIA GeForce RTX 3090 or Intel i7-10700K CPU
- I had some trouble installing things initially but the method described in this answer ended up working -- could that be the root issue?
To demonstrate, I copy/pasted the code from the CNN tutorial into the following script:
import os
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3' # minimize logs
import tensorflow as tf
from tensorflow.keras import datasets, layers, models
import matplotlib.pyplot as plt
RUN_ON_CPU = False
if RUN_ON_CPU:
os.environ['CUDA_VISIBLE_DEVICES'] = '-1' # prevent cpu from running to see if that's the issue
print('gpu available', tf.config.list_physical_devices('GPU'))
# load dataset
(train_images, train_labels), (test_images, test_labels) = datasets.cifar10.load_data()
# Normalize pixel values to be between 0 and 1
train_images, test_images = train_images / 255.0, test_images / 255.0
# build model backbone
model = models.Sequential()
model.add(layers.Conv2D(32, (3, 3), activation='relu', input_shape=(32, 32, 3)))
model.add(layers.MaxPooling2D((2, 2)))
model.add(layers.Conv2D(64, (3, 3), activation='relu'))
model.add(layers.MaxPooling2D((2, 2)))
model.add(layers.Conv2D(64, (3, 3), activation='relu'))
# add dense layers on top
model.add(layers.Flatten())
model.add(layers.Dense(64, activation='relu'))
model.add(layers.Dense(10))
model.summary()
# compile and train
model.compile(optimizer='adam',
loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
metrics=['accuracy'])
history = model.fit(train_images, train_labels, epochs=10,
validation_data=(test_images, test_labels))
plt.figure()
plt.plot(history.history['accuracy'], label='accuracy')
plt.plot(history.history['val_accuracy'], label = 'val_accuracy')
plt.xlabel('Epoch')
plt.ylabel('Accuracy')
# plt.ylim([0.5, 1])
plt.legend(loc='lower right')
if RUN_ON_CPU:
plt.title('Training on CPU')
else:
plt.title('Training on GPU')
test_loss, test_acc = model.evaluate(test_images, test_labels, verbose=2)
print('test loss and accuracy', test_loss, test_acc)
Which plots the following training curves depending on the RUN_ON_CPU
flag:
GPU test loss and accuracy 2.302645444869995 0.10000000149011612
CPU test loss and accuracy 0.879743754863739 0.7060999870300293
The tutorial claims that the CNN should achieve a test accuracy of ~70% which the GPU doesn't come close to. To be sure I logged tf.config.list_physical_devices('GPU')
and the GPU took 2-3s per epoch whereas CPU took 11-14s. Using os.environ['CUDA_VISIBLE_DEVICES'] = '-1'
to turn off the GPU was the only code change between the runs.