I've been trying to train a CNN written using the Tensorflow implementation of Keras. It appears as though training gets stuck when it hits the first epoch - although it looks like my GPUs are still using memory according to nvidia-smi. There are no error messages or tracebacks that are printed to the terminal either, which is making debugging this a little tricky for me. I've also written this code using TF estimators and datasets, the network didn't train when I left it overnight. Therefore, I don't think that this is just a case of leaving the code to run for longer - it's probably something I've done, but it may also be due to (an allegedly fixed) bug according to the second link below.
At the moment, I'm also trying to track training process using the "verbose" argument in model.fit() to see if anything is happening. I'm not seeing anything appear in the terminal though. Other people who get this problem seem to still be getting a progress bar to appear.
I've also tried logging with TensorBoard and saving model checkpoints. No checkpoints are being saved and regarding Tensorboard, it looks there are no graphs being saved either.
Any ideas on what might be causing this?
Can't get past first epoch -- just hangs [Keras Transfer Learning Inception]
Keras fit freezes at the end of the first epoch
import os
import tensorflow as tf
from tensorflow import keras
import cv2
import numpy as np
from tensorflow.python.framework.graph_util import convert_variables_to_constants
from tensorflow.python.keras import backend as K
cwd = os.getcwd()
log_dir = cwd + "/Keras_Model/"
callbacks = [keras.callbacks.ModelCheckpoint(filepath="./Checkpoints/weights.{epoch:02d}-{val_loss:.2f}.hdf5"),
keras.callbacks.TensorBoard(log_dir="./logs")]
def freeze_session(session, keep_var_names=None, output_names=None, clear_devices=True):
"""
TAKEN FROM HERE: https://stackoverflow.com/questions/45466020/how-to-export-keras-h5-to-tensorflow-pb
Freezes the state of a session into a pruned computation graph. Used later to save model as TF pb file.
Creates a new computation graph where variable nodes are replaced by
constants taking their current value in the session. The new graph will be
pruned so subgraphs that are not necessary to compute the requested
outputs are removed.
@param session The TensorFlow session to be frozen.
@param keep_var_names A list of variable names that should not be frozen,
or None to freeze all the variables in the graph.
@param output_names Names of the relevant graph outputs.
@param clear_devices Remove the device directives from the graph for better portability.
@return The frozen graph definition.
"""
graph = session.graph
with graph.as_default():
freeze_var_names = list(set(v.op.name for v in tf.global_variables()).difference(keep_var_names or []))
output_names = output_names or []
output_names += [v.op.name for v in tf.global_variables()]
input_graph_def = graph.as_graph_def()
if clear_devices:
for node in input_graph_def.node:
node.device = ""
frozen_graph = convert_variables_to_constants(session, input_graph_def,
output_names, freeze_var_names)
return frozen_graph
### IMPORT TRAINING IMAGES AS NUMPY ARRAY ###
t_dir = cwd + "/data-1/training/"
e_dir = cwd + "/data-1/evaluation"
xtrain = []
ytrain = []
print(" - Collating training data and labels... - ")
for subdir, dirs, files in os.walk(t_dir):
for f in files:
img = os.path.join(subdir, f)
x = cv2.imread(img) # --> Produces 8-bit tensor from image file.
y = int(img.split("/")[-2]) - 1 # --> Get label from file path.
xtrain.append(x)
ytrain.append(y)
data = np.asarray(xtrain)
print(" - Training data collated. - ")
labels = np.asarray(ytrain)
print(" - Training labels collated. - ")
### IMPORT EVALUATION IMAGES AS TF ITERATOR ###
xeval = []
yeval = []
print(" - Collating validation data and labels... - ")
for subdir, dirs, files in os.walk(e_dir):
for f in files:
img = os.path.join(subdir, f)
x = cv2.imread(img) # --> Produces 8-bit tensor from image file.
y = int(img.split("/")[-2]) - 1 # --> Get label from file path.
xeval.append(x)
yeval.append(y)
val_data = np.asarray(xeval)
print(" - Validation data collated. - ")
val_labels = np.asarray(yeval)
print(" - Validation labels collated. - ")
### CREATE MODEL ###
model = keras.Sequential()
model.add(keras.layers.Conv2D(filters=32, kernel_size=5, strides=1, padding="same", data_format = "channels_last", activation="relu", input_shape= (480,640,3)))
model.add(keras.layers.GlobalMaxPool2D(data_format = "channels_last"))
model.add(keras.layers.Dense(64, activation="relu"))
model.add(keras.layers.Dropout(0.4)) # --> Change dropout rate here.
model.add(keras.layers.Dense(8, activation="softmax"))
model.compile(optimizer=tf.train.AdamOptimizer(0.001), # --> Choose learning rate here.
loss=keras.losses.sparse_categorical_crossentropy,
metrics=[keras.metrics.categorical_accuracy])
print(" - Model created... - ")
print(" - Model Summary - ")
model.summary() # --> Print model summary.
### TRAIN AND EVALUATE MODEL ###
print(" - Training model... - ")
model.fit(data, labels, epochs = 5, batch_size=32, callbacks=callbacks, validation_data=(val_data, val_labels), verbose = 2)
print(" - Model trained! - ")
### SAVE MODEL AS H5 AND PB FILES ###
model.save("./Keras_Model/model.h5", save_format="h5")
print(" - Saved model as h5. - ")
frozen_graph = freeze_session(K.get_session(), output_names=[out.op.name for out in model.outputs])
tf.train.write_graph(frozen_graph, "./Tensorflow_Model/", "model.pb", as_text=False)
print(" - Saved model as pb. - ")
print(" - Clearing session. - ")
keras.clear_session()
I can also provide the version where I use TF datasets and evaluators, or anything else if I can. Apologies if I've left anything obvious out, I've just started using SO.
UPDATE: I went home last night and ran this script on my computer - it seems to work so clearly this is not a usage issue, but probably either a problem with TF itself or the way it's been configured on our server. It's a bit bizarre because TF was working at some point previously, but what can you do. Cheers all.