I run a tensorflow u-net model without dropout (but BN) with a custom metric called "average accuracy". This is literally the section of code. As you can see, datasets must be the same as I do nothing in between fit
and evaluate
.
model.fit(x=train_ds, epochs=epochs, validation_data=val_ds, shuffle=True,
callbacks=callbacks)
model.evaluate(train_ds)
model.evaluate(val_ds)
train_ds
and val_ds
are tf.Dataset
. And here the output.
...
Epoch 10/10
148/148 [==============================] - 103s 698ms/step - loss: 0.1765 - accuracy: 0.5872 - average_accuracy: 0.9620 - val_loss: 0.5845 - val_accuracy: 0.5788 - val_average_accuracy: 0.5432
148/148 [==============================] - 22s 118ms/step - loss: 0.5056 - accuracy: 0.4540 - average_accuracy: 0.3654
29/29 [==============================] - 5s 122ms/step - loss: 0.5845 - accuracy: 0.5788 - average_accuracy: 0.5432
There is an unbelievable difference between average_accuracy
during training (fit
) and average_accuracy
of evaluate
(both on training dataset). I know that BN can have this effect and also that performance changes during training so they will never be equal. But from 96% to 36%?
My custom accuracy is defined here but I doubt it's my personal implementation as it should be somehow close no matter what I did (I think).
Any hint here is useful. I don't know if I should review the custom metric, the dataset, the model. It seems outside all of them.
I tried to continue training after stopping and average_accuracy
starts from where it left at more than 90%.
Context of custom metric. I use it for semantic segmentation. So each image has an image of labels as output of WxHx4 (4 are my total number of classes).
It computes the average accuracy, for example, the accuracy of each class separately and then, if they were 4 classes it does sum(accuracies per class) / 4.
Here the main code:
def custom_average_accuracy(y_true, y_pred):
# Mask to remove the labels (y_true) that are zero: ex. [0, 0, 0]
remove_zeros_mask = tf.math.logical_not(tf.math.reduce_all(tf.math.logical_not(tf.cast(y_true, bool)), axis=-1))
y_true = tf.boolean_mask(y_true, remove_zeros_mask)
y_pred = tf.boolean_mask(y_pred, remove_zeros_mask)
num_cls = y_true.shape[-1]
y_pred = tf.math.argmax(y_pred, axis=-1) # ex. [0, 0, 1] -> [2]
y_true = tf.math.argmax(y_true, axis=-1)
accuracies = tf.TensorArray(tf.float32, size=0, dynamic_size=True)
for i in range(0, num_cls):
cls_mask = y_true == i
cls_y_true = tf.boolean_mask(y_true, cls_mask)
if not tf.equal(tf.size(cls_y_true), 0): # Some images don't have all the classes present.
new_acc = _accuracy(y_true=cls_y_true, y_pred=tf.boolean_mask(y_pred, cls_mask))
accuracies = accuracies.write(accuracies.size(), new_acc)
accuracies = accuracies.stack()
return tf.math.reduce_sum(accuracies) / tf.cast(len(accuracies), dtype=accuracies.dtype)
I believe the problem might be on the if not tf.equal(tf.size(cls_y_true), 0)
line but I still can't seem were.
More wird information. This is exactly my lines of code:
x_input, y_true = np.concatenate([x for x, y in ds], axis=0), np.concatenate([y for x, y in ds], axis=0)
model.evaluate(x=x_input, y=y_true) # This gets 38% accuracy
model.evaluate(ds) # This gets 55% accuracy
What the hell is going on here? How can those lines of code give a different result?!?!
So now I have that if I don't do the ds = ds.shuffle()
the example up (30ish vs 50ish ACC values) are Ok.