tensorflow model.evaluate and model.predict very different results

Question

I am building a simple CNN for binary image classification, and the AUC obtained from model.evaluate() is much higher than AUC obtained from model.predict() + roc_auc_score().

The whole notebook is here.

Compiling model and output for model.fit():

model.compile(loss='binary_crossentropy',
              optimizer=RMSprop(lr=0.001),
              metrics=['AUC'])

history = model.fit(
      train_generator,
      steps_per_epoch=8,  
      epochs=5,
      verbose=1)

Epoch 1/5 8/8 [==============================] - 21s 3s/step - loss: 6.7315 - auc: 0.5143

Epoch 2/5 8/8 [==============================] - 15s 2s/step - loss: 0.6626 - auc: 0.6983

Epoch 3/5 8/8 [==============================] - 18s 2s/step - loss: 0.4296 - auc: 0.8777

Epoch 4/5 8/8 [==============================] - 14s 2s/step - loss: 0.2330 - auc: 0.9606

Epoch 5/5 8/8 [==============================] - 18s 2s/step - loss: 0.1985 - auc: 0.9767

Then model.evaluate() gives something similar:

model.evaluate(train_generator)

9/9 [==============================] - 10s 1s/step - loss: 0.3056 - auc: 0.9956

But then AUC calculated directly from model.predict() method is twice as lower:

from sklearn import metrics

x = model.predict(train_generator)
metrics.roc_auc_score(train_generator.labels, x)

0.5006148007590132

I have read several posts on similar issues (like this, this, this and also extensive discussion on github), but they describe reasons which are irrelevant for my case:

using binary_crossenthropy for multiclass task (not my case)
difference between evaluate and predict due to using batch vs whole dataset (should not cause such drastic decline as in my case)
using batch normalization and regularization (not my case and also should not cause such large decline)

Any suggestions are much appreciated. Thanks!

EDIT! Solution I have founded the solution here, I just needed to call

train_generator.reset()

before model.predict and also set shuffle = False in flow_from_directory() function. The reason for difference is that generator outputs batches starting from different position, so labels and predictions will not match, because they relate to different objects. So the problem is not with evaluate or predict methods, but with generator.

EDIT 2 Using train_generator.reset() is not convenient if generator is created using flow_from_directory(), because it requires setting shuffle = False in flow_from_directory, but this will create batches containing single class during training, which affects learning. So I ended up with redefining train_generator before running predict.

when you evaluate or predict a model, cannot use the training set. this set should be some images that were not used in training. i had the same problem where evaluate() and predict() gave very disparate numbers where predict() was way lower. — Nguai al, Mar 09 '22 at 01:02
You must actually respond with your own solution to mark it as solved. That way the question will not remain open. — J Agustin Barrachina, Apr 06 '22 at 14:22

score 1 · Answer 1 · answered May 19 '20 at 20:05

1

tensorflow.keras AUC computes the approximate AUC (Area under the curve) via a Riemann sum, which is not the same implementation as scikit-learn.

If you want to find the AUC with tensorflow.keras, try:

import tensorflow as tf

m = tf.keras.metrics.AUC()

m.update_state(train_generator.labels, x) # assuming both have shape (N,)

r = m.result().numpy()

print(r)

answered May 19 '20 at 20:05

Zabir Al Nazi

10,298
4
33
60

Thanks for your suggestion! Unfortunately, this did not help - the result, obtained with your solution, is very similar to model.predict and also much lower than model.evaluate (see [here](https://github.com/pro100olga/dlaicourse/blob/master/evaluate_predict.ipynb) in the end of file). I also think that differences in calculations in different implementations may cause minor differences, but not 0.9 vs 0.5. – Olga Makarova May 20 '20 at 09:12
can you add the shapes of train_generator.labels and x? also first few values? – Zabir Al Nazi May 20 '20 at 09:39
sure, please see [here](https://github.com/pro100olga/dlaicourse/blob/master/evaluate_predict.ipynb) (again in the end of the file) – Olga Makarova May 20 '20 at 11:01

tensorflow model.evaluate and model.predict very different results

1 Answers1

Linked