0

I get a very strange behavior when comparing model.evaluate() and model.predict() results. As you can see in the screenshot I get ~0.926 f1 for the precision and recall returned from model.evaluate() but for the predictions made by model.predict() the f1 is much lower. Any ideas how this could happen?

screenshot of code

This only happens for the evaluation of an out of sample dataset. For the test-data used during training as validation data, the model.evaluate() and model.predict() give the same f1.

model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy', tf.keras.metrics.Precision(), tf.keras.metrics.Recall()])
maxi
  • 1
  • 2
  • 1
    eval vs predict? they're not the same thing to compare, eval gives the loss value, predict gives the output of feedforward – Dee Jun 02 '21 at 08:32
  • This could be of help: https://stackoverflow.com/q/44476706/11220884 – Tinu Jun 02 '21 at 08:39
  • 1
    @datdinhquoc sry, I mean sklearn f1 calculation based on the results from predict – maxi Jun 02 '21 at 08:39
  • @Tinu I dont understand how "batching" is involved in this case. Does it mean that the recall and precision given are the averages over the 14879 datapoints evaluated instead of the true precision and recall. Does it mean I should completely ignore the results from model.evaluate( ) and just calculate manually based on model.predict( )? – maxi Jun 02 '21 at 08:47

1 Answers1

0

tf.keras.metrics.Precision() & tf.keras.metrics.Recall(): These have 'micro' average by default.

from sklearn.metrics import f1_score: This one has 'macro' average by default.

If you have an imbalanced classification problem, you need 'macro'.

You can directly give macro F1 score as a metric in model.compile as :

tfa.metrics.FBetaScore(num_classes= 2, average="macro",threshold=0.9, name='f1_score', dtype= None)

For example:

model.compile(loss=tf.keras.losses.BinaryCrossentropy(), 
                  optimizer=tf.keras.optimizers.Adam(learning_rate=0.001), 
                  
metrics=[tf.keras.metrics.Recall(name='Recall'),
                           tf.keras.metrics.Precision(name='Precision'), 
                           tfa.metrics.FBetaScore(num_classes= 2, average="macro",threshold=0.9, name='f1_score', dtype= None),
                           keras.metrics.AUC(name='prc', curve='PR'), # precision-recall curve
Jeremy Caney
  • 7,102
  • 69
  • 48
  • 77