Because this exploding gradients and exploding loss happens when the network is huge, so I don't bother post the entire network here. But I've tried my best, the past two weeks, I dig down into the very details of the source code to monitor some weights, hand code the update step to monitor loss, weights, updates, gradients and hyperparameters to compare to the internal status. I think I've done some homework before I ask here.
The question is there are two training methods using Keras API, the is model.fit()
, 2nd is more customized one for more complex training and network, but while I kept nearly everything the same to each other, the model.fit()
does not have exploding loss, but the custom method gives exploding. Funny enough, when I monitor many details under a much smaller size network, everything seems identical from two methods.
Environment:
# tensorflow 1.14
import tensorflow as tf
from tensorflow.keras import backend as K
For the model.fit()
method:
# I skipped the details of the below two lines as I couldn't share the very details. but x is [10000, 32, 32, 3] image data, y is [10000, 10, 1] label. model is regular Keras model.
x_train, y_train, x_test, y_test = get_data()
model = get_keras_model()
loss_fn = tf.keras.losses.CategoricalCrossentropy()
sgd = tf.keras.optimizers.SGD(lr=.1, momentum=0.9, nesterov=True)
model.compile(loss=loss_fn, optimizer=sgd, metrics=['accuracy'])
history = model.fit(x_train, y_train, batch_size=128, epochs=100, validation_data=(x_test, y_test))
Custom Method:
x_train, y_train, x_test, y_test = get_data()
model = get_keras_model()
input = model.inputs[0]
y_true = tf.placeholder(dtype = tf.int32, shape = [None, 10])
y_pred = model.outputs[0]
loss_fn = tf.keras.losses.CategoricalCrossentropy()
loss = loss_fn(y_true, y_pred)
weights = model.trainable_weights
sgd = tf.keras.optimizers.SGD(lr=.1, momentum=0.9, nesterov=True)
training_updates = sgd.get_updates(loss, weights)
training_fn = K.function([y_true, input], [loss], training_updates)
num_train = 10000
steps_per_epoch = int(num_train / 128) # batch size 128
total_steps = steps_per_epoch * 100 # epoch 100
for step in total_steps:
idx = np.random.randint(0, 10000, 128)
input_img = x_train[idx]
ground_true = y_train[idx]
cur_loss = training_fn([ground_true, input_img])
So in short, same model, same loss function, same optimizer SGD, same image feed (I do control the image feed order altho the code here is random selection from training data). Is there anything in the internal process of model.fit()
that would prevent the loss or gradient exploding?