Unbelievable difference of custom metric between fit and evaluate

Question

I run a tensorflow u-net model without dropout (but BN) with a custom metric called "average accuracy". This is literally the section of code. As you can see, datasets must be the same as I do nothing in between fit and evaluate.

model.fit(x=train_ds, epochs=epochs, validation_data=val_ds, shuffle=True, 
callbacks=callbacks)
model.evaluate(train_ds)
model.evaluate(val_ds)

train_ds and val_ds are tf.Dataset. And here the output.

...
Epoch 10/10
148/148 [==============================] - 103s 698ms/step - loss: 0.1765 - accuracy: 0.5872 - average_accuracy: 0.9620 - val_loss: 0.5845 - val_accuracy: 0.5788 - val_average_accuracy: 0.5432
148/148 [==============================] - 22s 118ms/step - loss: 0.5056 - accuracy: 0.4540 - average_accuracy: 0.3654
29/29 [==============================] - 5s 122ms/step - loss: 0.5845 - accuracy: 0.5788 - average_accuracy: 0.5432

There is an unbelievable difference between average_accuracy during training (fit) and average_accuracy of evaluate (both on training dataset). I know that BN can have this effect and also that performance changes during training so they will never be equal. But from 96% to 36%?

My custom accuracy is defined here but I doubt it's my personal implementation as it should be somehow close no matter what I did (I think).

Any hint here is useful. I don't know if I should review the custom metric, the dataset, the model. It seems outside all of them.

I tried to continue training after stopping and average_accuracy starts from where it left at more than 90%.

Context of custom metric. I use it for semantic segmentation. So each image has an image of labels as output of WxHx4 (4 are my total number of classes).

It computes the average accuracy, for example, the accuracy of each class separately and then, if they were 4 classes it does sum(accuracies per class) / 4.

Here the main code:

def custom_average_accuracy(y_true, y_pred):
    # Mask to remove the labels (y_true) that are zero: ex. [0, 0, 0]
    remove_zeros_mask = tf.math.logical_not(tf.math.reduce_all(tf.math.logical_not(tf.cast(y_true, bool)), axis=-1))
    y_true = tf.boolean_mask(y_true, remove_zeros_mask)
    y_pred = tf.boolean_mask(y_pred, remove_zeros_mask)
    num_cls = y_true.shape[-1]
    y_pred = tf.math.argmax(y_pred, axis=-1)        # ex. [0, 0, 1] -> [2]
    y_true = tf.math.argmax(y_true, axis=-1)
    accuracies = tf.TensorArray(tf.float32, size=0, dynamic_size=True)
    for i in range(0, num_cls):
        cls_mask = y_true == i
        cls_y_true = tf.boolean_mask(y_true, cls_mask)
        if not tf.equal(tf.size(cls_y_true), 0):   # Some images don't have all the classes present.
            new_acc = _accuracy(y_true=cls_y_true, y_pred=tf.boolean_mask(y_pred, cls_mask))
            accuracies = accuracies.write(accuracies.size(), new_acc)
    accuracies = accuracies.stack()
    return tf.math.reduce_sum(accuracies) / tf.cast(len(accuracies), dtype=accuracies.dtype)

I believe the problem might be on the if not tf.equal(tf.size(cls_y_true), 0) line but I still can't seem were.

More wird information. This is exactly my lines of code:

x_input, y_true = np.concatenate([x for x, y in ds], axis=0), np.concatenate([y for x, y in ds], axis=0)
model.evaluate(x=x_input, y=y_true)   # This gets 38% accuracy
model.evaluate(ds)                    # This gets 55% accuracy

What the hell is going on here? How can those lines of code give a different result?!?!

So now I have that if I don't do the ds = ds.shuffle() the example up (30ish vs 50ish ACC values) are Ok.

Yes, the fact that training is done at each batch and therefore training ACC is different is what I meant when I wrote "performance changes during training" although I agree is not so clear. But again, that makes performance go from 96% to 36%? Makes no sense to me. — J Agustin Barrachina, Nov 26 '21 at 08:47

score 1 · Answer 1 · answered Nov 26 '21 at 09:11

I tried to reproduce this behavior but could not find the discrepancies you noted. The only thing I changed was not tf.equal to tf.math.not_equal:

import pathlib
import tensorflow as tf

dataset_url = "https://storage.googleapis.com/download.tensorflow.org/example_images/flower_photos.tgz"
data_dir = tf.keras.utils.get_file('flower_photos', origin=dataset_url, untar=True)
data_dir = pathlib.Path(data_dir)

num_classes = 5
batch_size = 32
img_height = 180
img_width = 180

val_ds = tf.keras.utils.image_dataset_from_directory(
  data_dir,
  validation_split=0.2,
  subset="validation",
  seed=123,
  image_size=(img_height, img_width),
  batch_size=batch_size)

train_ds = tf.keras.utils.image_dataset_from_directory(
  data_dir,
  validation_split=0.2,
  subset="training",
  seed=123,
  image_size=(img_height, img_width),
  batch_size=batch_size)

def to_categorical(images, labels):
  return images, tf.one_hot(labels, num_classes)

train_ds = train_ds.map(to_categorical)
val_ds = val_ds.map(to_categorical)

model = tf.keras.Sequential([
  tf.keras.layers.Rescaling(1./255, input_shape=(img_height, img_width, 3)),
  tf.keras.layers.Conv2D(16, 3, padding='same', activation='relu'),
  tf.keras.layers.MaxPooling2D(),
  tf.keras.layers.Dropout(0.3),
  tf.keras.layers.Conv2D(32, 3, padding='same', activation='relu'),
  tf.keras.layers.MaxPooling2D(),
  tf.keras.layers.Dropout(0.3),
  tf.keras.layers.Conv2D(64, 3, padding='same', activation='relu'),
  tf.keras.layers.MaxPooling2D(),
  tf.keras.layers.Dropout(0.3),
  tf.keras.layers.Flatten(),
  tf.keras.layers.Dense(128, activation='relu'),
  tf.keras.layers.Dense(num_classes)
])

def _accuracy(y_true, y_pred):
    y_true.shape.assert_is_compatible_with(y_pred.shape)
    if y_true.dtype != y_pred.dtype:
        y_pred = tf.cast(y_pred, y_true.dtype)
    reduced_sum = tf.reduce_sum(tf.cast(tf.math.equal(y_true, y_pred), tf.keras.backend.floatx()), axis=-1)
    return tf.math.divide_no_nan(reduced_sum, tf.cast(tf.shape(y_pred)[-1], reduced_sum.dtype))

def custom_average_accuracy(y_true, y_pred):
    # Mask to remove the labels (y_true) that are zero: ex. [0, 0, 0]
    remove_zeros_mask = tf.math.logical_not(tf.math.reduce_all(tf.math.logical_not(tf.cast(y_true, bool)), axis=-1))
    y_true = tf.boolean_mask(y_true, remove_zeros_mask)
    y_pred = tf.boolean_mask(y_pred, remove_zeros_mask)
    num_cls = y_true.shape[-1]
    y_pred = tf.math.argmax(y_pred, axis=-1)        # ex. [0, 0, 1] -> [2]
    y_true = tf.math.argmax(y_true, axis=-1)
    accuracies = tf.TensorArray(tf.float32, size=0, dynamic_size=True)
    for i in range(0, num_cls):
        cls_mask = y_true == i
        cls_y_true = tf.boolean_mask(y_true, cls_mask)
        if tf.math.not_equal(tf.size(cls_y_true), 0):   # Some images don't have all the classes present.
            new_acc = _accuracy(y_true=cls_y_true, y_pred=tf.boolean_mask(y_pred, cls_mask))
            accuracies = accuracies.write(accuracies.size(), new_acc)
    accuracies = accuracies.stack()
    return tf.math.reduce_sum(accuracies) / tf.cast(len(accuracies), dtype=accuracies.dtype)

model.compile(optimizer='adam',
              loss=tf.keras.losses.CategoricalCrossentropy(from_logits=True),
              metrics=['accuracy', custom_average_accuracy])

epochs=10
history = model.fit(
  train_ds,
  validation_data=val_ds,
  epochs=epochs)

model.evaluate(train_ds)
model.evaluate(val_ds)

Found 3670 files belonging to 5 classes.
Using 734 files for validation.
Found 3670 files belonging to 5 classes.
Using 2936 files for training.
Epoch 1/10
92/92 [==============================] - 11s 95ms/step - loss: 1.6220 - accuracy: 0.2868 - custom_average_accuracy: 0.2824 - val_loss: 1.2868 - val_accuracy: 0.4946 - val_custom_average_accuracy: 0.4597
Epoch 2/10
92/92 [==============================] - 8s 85ms/step - loss: 1.2131 - accuracy: 0.4785 - custom_average_accuracy: 0.4495 - val_loss: 1.2051 - val_accuracy: 0.4673 - val_custom_average_accuracy: 0.4350
Epoch 3/10
92/92 [==============================] - 8s 84ms/step - loss: 1.0713 - accuracy: 0.5620 - custom_average_accuracy: 0.5404 - val_loss: 1.1070 - val_accuracy: 0.5232 - val_custom_average_accuracy: 0.5003
Epoch 4/10
92/92 [==============================] - 8s 83ms/step - loss: 0.9463 - accuracy: 0.6281 - custom_average_accuracy: 0.6203 - val_loss: 0.9880 - val_accuracy: 0.5967 - val_custom_average_accuracy: 0.5755
Epoch 5/10
92/92 [==============================] - 8s 84ms/step - loss: 0.8400 - accuracy: 0.6771 - custom_average_accuracy: 0.6730 - val_loss: 0.9420 - val_accuracy: 0.6308 - val_custom_average_accuracy: 0.6245
Epoch 6/10
92/92 [==============================] - 8s 83ms/step - loss: 0.7594 - accuracy: 0.7027 - custom_average_accuracy: 0.7004 - val_loss: 0.8972 - val_accuracy: 0.6431 - val_custom_average_accuracy: 0.6328
Epoch 7/10
92/92 [==============================] - 8s 82ms/step - loss: 0.6211 - accuracy: 0.7619 - custom_average_accuracy: 0.7563 - val_loss: 0.8999 - val_accuracy: 0.6431 - val_custom_average_accuracy: 0.6174
Epoch 8/10
92/92 [==============================] - 8s 82ms/step - loss: 0.5108 - accuracy: 0.8116 - custom_average_accuracy: 0.8046 - val_loss: 0.8809 - val_accuracy: 0.6689 - val_custom_average_accuracy: 0.6457
Epoch 9/10
92/92 [==============================] - 8s 83ms/step - loss: 0.3985 - accuracy: 0.8535 - custom_average_accuracy: 0.8534 - val_loss: 0.9364 - val_accuracy: 0.6676 - val_custom_average_accuracy: 0.6539
Epoch 10/10
92/92 [==============================] - 8s 83ms/step - loss: 0.3023 - accuracy: 0.8995 - custom_average_accuracy: 0.9010 - val_loss: 1.0118 - val_accuracy: 0.6662 - val_custom_average_accuracy: 0.6405
92/92 [==============================] - 6s 62ms/step - loss: 0.2038 - accuracy: 0.9363 - custom_average_accuracy: 0.9357
23/23 [==============================] - 2s 50ms/step - loss: 1.0118 - accuracy: 0.6662 - custom_average_accuracy: 0.663

Interesting, this is a good clue, meaning the problem is most likely on the dataset I am using. Good clue, Thanks! I will keep you posted once I find the problem. — J Agustin Barrachina, Nov 26 '21 at 11:45
BTW, flowers dataset is classification and my case is segmentation. The issue might be there. I will try with other segmentation datasets to see what happens. — J Agustin Barrachina, Nov 26 '21 at 11:46

J Agustin Barrachina · Accepted Answer · 2022-04-06T14:52:55.703

0

Well, I was using a TensorFlow dataset. I changed to NumPy and now all seems logical and works.

Still, I need to know the reason tf ds didn't work but at least I don't longer have these weird results.

Not tested yet (I would need to get the code back to what it was, probably do it someday) but this might be related.

edited Apr 06 '22 at 14:52

answered Dec 07 '21 at 09:56

J Agustin Barrachina

3,501
1
32
52

Unbelievable difference of custom metric between fit and evaluate

2 Answers2