Why differ metrics calculated by model.evaluate() from tracked metrics during training in Keras?

Question

I am using Keras 2.0.4 (TensorFlow backend) for an image classification task (based on pretrained models). During training/tuning I track all used metrics (e.g. categorical_accuracy, categorical crossentropy) with CSVLogger - including the corresponding metrics being associated with the validation set (i.e. val_categorical_accuracy, val_categorical_crossentropy).

With the callback ModelCheckpoint I am tracking the best configuration of weights (save_best_only=True). In order to evaluate the model on the validation set I use model.evaluate().

My expectation is: tracked metrics by CSVLogger (of 'best' epoch) equal the metrics calculated by model.evaluate(). Unfortunately this is NOT the case. Metrics differ by +- 5%. Is there a reason for this behavior?

E D I T:

After some testing I could gain some insights:

If I don't use a generator for training and validation data (and therefore no model.fit_generator()), the problem doesn't occur. --> Using the ImageDataGenerator for training and validation data is the source of the discrepancy. (Please note, for calculation of evaluate I don't use a generator, but I do use the same validation data (at least if DataImageGenerator would work as expected...).
I think, the ImageDataGenerator doesn't work as it should (please, also have a look at this).
If I use no generators at all, there won't be this problem. Id est tracked metrics by CSVLogger (of 'best' epoch) equal the metrics calculated by model.evaluate().
Interestingly, there is another problem: if you use the same data for training and validation, there will be a discrepancy between training metrics (e.g. loss) and validation metrics (e.g. val_loss) at the end of each epoch.
(A similar problem)

Used Code:

############################ import section ############################
from __future__ import print_function # perform like in python 3.x
from keras.datasets import mnist
from keras.utils import np_utils # numpy utils for to_categorical()
from keras.models import Model, load_model
from keras.layers import Dense, GlobalAveragePooling2D, Dropout, GaussianDropout, Conv2D, MaxPooling2D
from keras.optimizers import SGD, Adam
from keras import backend as K
from keras.preprocessing.image import ImageDataGenerator 
from keras import metrics
import os
import sys
from scipy import misc
import numpy as np
from keras.applications.vgg16 import preprocess_input as vgg16_preprocess_input
from keras.applications import VGG16
from keras.callbacks import CSVLogger, ModelCheckpoint


############################ manual settings ###########################
# general settings
seed = 1337

loss_function = 'categorical_crossentropy'

learning_rate = 0.001

epochs = 10

batch_size = 20

nb_classes = 5 

img_width, img_height = 400, 400 # >= 48 necessary, as VGG16 is used

chosen_optimizer = SGD(lr=learning_rate, momentum=0.0, decay=0.0, nesterov=False)

steps_per_epoch = 40 // batch_size  # 40 train samples in 5 classes
validation_steps = 40 // batch_size # 40 train samples in 5 classes

data_dir = # TODO: set path where data is stored (folders: 'train', 'val', 'test'; within each folder are folders named by classes)

# callbacks: CSVLogger & ModelCheckpoint
filepath = # TODO: set path, where you want to store files generated by the callbacks
file_best_checkpoint= 'best_epoch.hdf5'
file_csvlogger = 'logged_metrics.txt'

modelcheckpoint_best_epoch= ModelCheckpoint(filepath=os.path.join(filepath, file_best_checkpoint), 
                                  monitor = 'val_loss' , verbose = 1, 
                                  save_best_only = True, 
                                  save_weights_only=False, mode='auto', 
                                  period=1) # every epoch executed
csvlogger = CSVLogger(os.path.join(filepath, file_csvlogger) , separator=',', append=False)



############################ prepare data ##############################
# get validation data (for evaluation)
X_val, Y_val = # TODO: load train data (4darray, samples, img_width, img_height, nb_channels) IMPORTANT: 5 classes with 8 images each.

# preprocess data
my_preprocessing_function = mf.my_vgg16_preprocess_input

# 'augmentation' configuration we will use for training
train_datagen = ImageDataGenerator(preprocessing_function = my_preprocessing_function) # only preprocessing; static data set

# 'augmentation' configuration we will use for validation
val_datagen = ImageDataGenerator(preprocessing_function = my_preprocessing_function) # only preprocessing; static data set

train_data_dir = os.path.join(data_dir, 'train')
validation_data_dir = os.path.join(data_dir, 'val')
train_generator = train_datagen.flow_from_directory(
    train_data_dir,
    target_size=(img_width, img_height),
    batch_size=batch_size,
    shuffle = True,
    seed = seed, # random seed for shuffling and transformations
    class_mode='categorical')  # label type (categorical = one-hot vector)

validation_generator = val_datagen.flow_from_directory(
    validation_data_dir,
    target_size=(img_width, img_height),
    batch_size=batch_size,
    shuffle = True,
    seed = seed, # random seed for shuffling and transformations
    class_mode='categorical')  # label type (categorical = one-hot vector)



############################## training ###############################
print("\n---------------------------------------------------------------")
print("------------------------ training model -----------------------")
print("---------------------------------------------------------------")
# create the base pre-trained model
base_model = VGG16(include_top=False, weights = None, input_shape=(img_width, img_height, 3), pooling = 'max', classes = nb_classes)
model_name =  "VGG_modified"

# do not freeze any layers --> all layers trainable
for layer in base_model.layers:
    layer.trainable = True

# define topping of base_model
x = base_model.output # get the last layer of our base_model
x = Dense(1024, activation='relu', name='fc1')(x)
x = Dense(1024, activation='relu', name='fc2')(x)
predictions = Dense(nb_classes, activation='softmax', name='predictions')(x)

# finally, stack model together
model = Model(outputs=predictions, name= model_name, inputs=base_model.input) #Keras 1.x.x: model = Model(input=base_model.input, output=predictions) 
print(model.summary())

# compile the model (should be done *after* setting layers to non-trainable)
model.compile(optimizer = chosen_optimizer, loss=loss_function, 
            metrics=['categorical_accuracy','kullback_leibler_divergence'])

# train the model on your data
model.fit_generator(
    train_generator,
    steps_per_epoch=steps_per_epoch,
    epochs=epochs,
    validation_data=validation_generator,
    validation_steps=validation_steps,
    callbacks = [csvlogger, modelcheckpoint_best_epoch])



############################## evaluation ##############################
print("\n\n---------------------------------------------------------------")
print("------------------ Evaluation of Best Epoch -------------------")
print("---------------------------------------------------------------")
# load model (corresponding to best training epoch)
model = load_model(os.path.join(filepath, file_best_checkpoint))

# evaluate model on validation data (in test mode!)
list_of_metrics = model.evaluate(X_val, Y_val, batch_size=batch_size, verbose=1, sample_weight=None)
index = 0
print('\nMetrics:')
for metric in model.metrics_names:
    print(metric+ ':' , str(list_of_metrics[index]))
    index += 1

E D I T 2
Referring to 1. of E D I T: If I use the same generator for validation data during training and evaluation (by using evaluate_generator()), the problem still occurs. Hence, it is definitely a problem caused by the generators...

score 0 · Answer 1 · answered May 12 '17 at 12:51

0

It will be the case only for the evalutation of the metrics on the validation dataset.

The metrics computed on the training dataset during training do not reflect the real metrics of the model at then end of the epoch as the model will be updated (modified) at each single batch.

Does this help ?

answered May 12 '17 at 12:51

Nassim Ben

11,473
1
34
52

CSVLogger tracks metrics on the validation set after every epoch. Let's assume, the last epoch would result in the best configuration of weights. This means, the last tracked metrics on the validation set are the metrics when evaluating on the validation set. What am I missing? – D.Laupheimer May 12 '17 at 12:55
hm what is the metric used to save the best only ? – Nassim Ben May 12 '17 at 13:04
monitored quantity is validation loss (`val_categorical_crossentropy`) – D.Laupheimer May 12 '17 at 13:07
Indeed it shouldn't matter... sorry I'm stuck as well on this case. Idealy you should propose some code so that we could reproduce your problem and help troubleshout it :-) – Nassim Ben May 12 '17 at 14:22
I tracked the problem. See Edits above – D.Laupheimer May 16 '17 at 13:08

Why differ metrics calculated by model.evaluate() from tracked metrics during training in Keras?

1 Answers1

Linked