tf.keras - Training on first epoch not progressing despite using GPU memory

Question

I've been trying to train a CNN written using the Tensorflow implementation of Keras. It appears as though training gets stuck when it hits the first epoch - although it looks like my GPUs are still using memory according to nvidia-smi. There are no error messages or tracebacks that are printed to the terminal either, which is making debugging this a little tricky for me. I've also written this code using TF estimators and datasets, the network didn't train when I left it overnight. Therefore, I don't think that this is just a case of leaving the code to run for longer - it's probably something I've done, but it may also be due to (an allegedly fixed) bug according to the second link below.

At the moment, I'm also trying to track training process using the "verbose" argument in model.fit() to see if anything is happening. I'm not seeing anything appear in the terminal though. Other people who get this problem seem to still be getting a progress bar to appear.

I've also tried logging with TensorBoard and saving model checkpoints. No checkpoints are being saved and regarding Tensorboard, it looks there are no graphs being saved either.

Any ideas on what might be causing this?

Can't get past first epoch -- just hangs [Keras Transfer Learning Inception]

Keras fit freezes at the end of the first epoch

import os
import tensorflow as tf
from tensorflow import keras
import cv2
import numpy as np
from tensorflow.python.framework.graph_util import convert_variables_to_constants
from tensorflow.python.keras import backend as K

cwd = os.getcwd()
log_dir = cwd + "/Keras_Model/"
callbacks = [keras.callbacks.ModelCheckpoint(filepath="./Checkpoints/weights.{epoch:02d}-{val_loss:.2f}.hdf5"),
         keras.callbacks.TensorBoard(log_dir="./logs")]

def freeze_session(session, keep_var_names=None, output_names=None, clear_devices=True):
"""
TAKEN FROM HERE: https://stackoverflow.com/questions/45466020/how-to-export-keras-h5-to-tensorflow-pb
Freezes the state of a session into a pruned computation graph. Used later to save model as TF pb file.

Creates a new computation graph where variable nodes are replaced by
constants taking their current value in the session. The new graph will be
pruned so subgraphs that are not necessary to compute the requested
outputs are removed.

@param session The TensorFlow session to be frozen.
@param keep_var_names A list of variable names that should not be frozen,
                      or None to freeze all the variables in the graph.
@param output_names Names of the relevant graph outputs.
@param clear_devices Remove the device directives from the graph for better portability.
@return The frozen graph definition.
"""
graph = session.graph
with graph.as_default():
    freeze_var_names = list(set(v.op.name for v in tf.global_variables()).difference(keep_var_names or []))
    output_names = output_names or []
    output_names += [v.op.name for v in tf.global_variables()]
    input_graph_def = graph.as_graph_def()
    if clear_devices:
        for node in input_graph_def.node:
            node.device = ""
    frozen_graph = convert_variables_to_constants(session, input_graph_def,
                                                  output_names, freeze_var_names)
    return frozen_graph

### IMPORT TRAINING IMAGES AS NUMPY ARRAY ###

t_dir = cwd + "/data-1/training/" 
e_dir = cwd + "/data-1/evaluation"

xtrain = []
ytrain = []

print(" - Collating training data and labels... - ")

for subdir, dirs, files in os.walk(t_dir):
    for f in files:
        img = os.path.join(subdir, f)
        x = cv2.imread(img) # --> Produces 8-bit tensor from image file.
        y = int(img.split("/")[-2]) - 1 # --> Get label from file path.
        xtrain.append(x)
        ytrain.append(y)

data = np.asarray(xtrain)
print(" - Training data collated. - ")
labels = np.asarray(ytrain)
print(" - Training labels collated. - ")


### IMPORT EVALUATION IMAGES AS TF ITERATOR ###

xeval = []
yeval = []

print(" - Collating validation data and labels... - ")

for subdir, dirs, files in os.walk(e_dir):
    for f in files:
        img = os.path.join(subdir, f)
        x = cv2.imread(img) # --> Produces 8-bit tensor from image file.
        y = int(img.split("/")[-2]) - 1 # --> Get label from file path.
        xeval.append(x)
        yeval.append(y)

 val_data = np.asarray(xeval)
 print(" - Validation data collated. - ")
 val_labels = np.asarray(yeval)
 print(" - Validation labels collated. - ")

 ### CREATE MODEL ###

 model = keras.Sequential()

 model.add(keras.layers.Conv2D(filters=32, kernel_size=5, strides=1, padding="same", data_format = "channels_last", activation="relu", input_shape=    (480,640,3)))

 model.add(keras.layers.GlobalMaxPool2D(data_format = "channels_last"))

 model.add(keras.layers.Dense(64, activation="relu"))

 model.add(keras.layers.Dropout(0.4)) # --> Change dropout rate here.

 model.add(keras.layers.Dense(8, activation="softmax"))

 model.compile(optimizer=tf.train.AdamOptimizer(0.001), # --> Choose learning rate here.
          loss=keras.losses.sparse_categorical_crossentropy,
          metrics=[keras.metrics.categorical_accuracy])

print(" - Model created... - ")
print(" - Model Summary - ")
model.summary() # --> Print model summary.

### TRAIN AND EVALUATE MODEL ###

print(" - Training model... - ")
model.fit(data, labels, epochs = 5, batch_size=32, callbacks=callbacks, validation_data=(val_data, val_labels), verbose = 2)
print(" - Model trained! - ")

### SAVE MODEL AS H5 AND PB FILES ###

model.save("./Keras_Model/model.h5", save_format="h5")
print(" - Saved model as h5. - ")

frozen_graph = freeze_session(K.get_session(), output_names=[out.op.name for out in model.outputs])
tf.train.write_graph(frozen_graph, "./Tensorflow_Model/", "model.pb", as_text=False)
print(" - Saved model as pb. - ")

print(" - Clearing session. - ")
keras.clear_session()

I can also provide the version where I use TF datasets and evaluators, or anything else if I can. Apologies if I've left anything obvious out, I've just started using SO.

UPDATE: I went home last night and ran this script on my computer - it seems to work so clearly this is not a usage issue, but probably either a problem with TF itself or the way it's been configured on our server. It's a bit bizarre because TF was working at some point previously, but what can you do. Cheers all.

Does a `tf` optimizer really work with keras?, try `optimizer='adam'`. Have you tried without the callbacks, just in case? — Daniel Möller, Aug 02 '18 at 14:10
I've tried "verbose=1" and "verbose=2" - neither of which output anything. — tm2313, Aug 02 '18 at 14:15
According to TF, optimizers should work. https://www.tensorflow.org/api_docs/python/tf/keras/optimizers. I'm essentially using the example that they've given on the website. As I'm fast learning, a lot of these are out of date, so you might be correct. Will give this a try in a second. This is the page I'm using as a reference: https://www.tensorflow.org/guide/keras. — tm2313, Aug 02 '18 at 14:16
I'd say `model.compile(optimizer='adam', loss='categorical_crossentropy')` would be the best to eliminate suspicions about compile. — Daniel Möller, Aug 02 '18 at 14:25
Okay, have just tried optimizer='adam' - which didn't seem to work. Have also checked with callbacks. — tm2313, Aug 02 '18 at 14:26
Actually was about to write this, but yes, this is a new development. I've tried using some scripts from a couple of months ago (which worked then) but those are doing the same thing as of 10 minutes ago. We had a server problem a few days ago and I'm talking to our system admin to see what the problem might be. — tm2313, Aug 02 '18 at 14:29
Maybe you should think of downgrading tensorflow it might be a bug.@tm2313 — Mohd Shibli, Sep 23 '18 at 07:35

tf.keras - Training on first epoch not progressing despite using GPU memory

0 Answers0

Linked