Keras: Classification report accuracy is different between model.predict accuracy for multiclass

Question

Colab link is here:

The data is imported the following was

train_ds = tf.keras.preprocessing.image_dataset_from_directory(
    main_folder,
    validation_split=0.1,
    subset="training",
    label_mode='categorical',
    seed=123,
    image_size=(dim, dim))

val_ds = tf.keras.preprocessing.image_dataset_from_directory(
    main_folder,
    validation_split=0.1,
    subset="validation",
    label_mode='categorical',
    seed=123,
    image_size=(dim, dim))

The model is trained the following way

model = tf.keras.models.Sequential([
    tf.keras.layers.experimental.preprocessing.Rescaling(1. / 255),
    ...
    tf.keras.layers.Dense(2, activation='softmax')
])

model.compile(optimizer="adam", loss=tf.keras.losses.CategoricalCrossentropy(), metrics=['accuracy'])

I am struggling with getting the right predicted categories and right true_categories to get the classification report to work:

y_pred = model.predict(val_ds, batch_size=1)
predicted_categories = np.argmax(y_pred, axis=1)

true_categories = tf.concat([y for x, y in val_ds], axis=0).numpy()
true_categories_argmax = np.argmax(true_categories, axis=1)

print(classification_report(true_categories_argmax, predicted_categories))

At the moment the output of the epoch is contradicting the classification report

Epoch 22/75
144/144 [==============================] - 7s 48ms/step - loss: 0.0611 - accuracy: 0.9776 - val_loss: 0.0768 - val_accuracy: 0.9765

The validation set on the model returns

model.evaluate(val_ds)

[==============================] - 0s 16ms/step - loss: 0.0696 - accuracy: 0.9784
[0.06963862478733063, 0.9784313440322876]

while the classification report is very different:

          precision    recall  f1-score   support
     0.0       0.42      0.44      0.43       221
     1.0       0.56      0.54      0.55       289
    accuracy                           0.49       510
   macro avg       0.49      0.49      0.49       510
weighted avg       0.50      0.49      0.50       510

Similiar questions here, here, here, here, here with no answers to this issue.

Does this answer your question? [Cannot get classification report even after rounding up predictions](https://stackoverflow.com/questions/66341444/cannot-get-classification-report-even-after-rounding-up-predictions) — o-90, Feb 26 '21 at 13:21
If this is a multi-label classification then you need to round your predictions to get predicted labels, also loss should be `binary_crossentropy`. Another thing is if your data is multilabeled then you should not be able to load them with `image_dataset_from_directory`, but I am not 100% sure. — Frightera, Feb 26 '21 at 14:29
@Frightera the output from `image_dataset_from_direcotory` says `Found 6457 files belonging to 2 classes. Using 5812 files for training. Found 6457 files belonging to 2 classes. Using 645 files for validation.` So that is fine. Because inside that folder it has the two classes separated. Regarding your point can you please show an example of ` round ing the predictions to get predicted labels`? — Joseph Adam, Feb 26 '21 at 15:40
I think you are confusing **multi-label** with **multi-class** classification. — Frightera, Feb 26 '21 at 16:01
@Frightera my apologies this is a multi-class, it is either a car or a bicycle. Never both. — Joseph Adam, Feb 26 '21 at 16:03
@gobrewers14 you basically voted the question to `close` and said what was wrong. Can you please propose the right answer then? — Joseph Adam, Mar 01 '21 at 09:29
@JosephAdam I am more than willing to help you, but the issue you are having is not reproducible. It is completely dependent on your specific training and validation data. I don't have that data. It is difficult to fix a problem I cannot reproduce on my computer at home. — o-90, Mar 01 '21 at 12:41
@gobrewers14 thanks for that. I have created a colab so you are able to see what I am seeing. Simply execute the whole notebook and you can see the `model.fit` contradicts the classification report https://colab.research.google.com/drive/1pQFYKRio7JiE5UdteBu2QN9DKQZ3eHcJ?usp=sharing — Joseph Adam, Mar 01 '21 at 14:00

Frightera · Accepted Answer · 2021-03-01T15:15:53.710

2

You set label_mode='categorical' then this is a multi-class classification and you need to use softmax activation in your last dense layer. Because softmax force the outputs sum to be equal to 1. You can kinda interpret them as probabilities. With sigmoid it will not be possible to find the dominant class. It can assign any values without restriction.

My model's last layer: Dense(5, activation = 'softmax')

My model's loss: loss=tf.keras.losses.CategoricalCrossentropy(), same as yours. Labels are one-hot-encoded in this case.

Explanation: I used a 5 class classification for demo purposes, but it follows the same logic.

y_pred = model.predict(val_ds)

y_pred[:2]
>>> array([[0.28257513, 0.4343998 , 0.18222839, 0.04164065, 0.05915598],
       [0.36404607, 0.08850227, 0.15335019, 0.21602921, 0.17807229]],
      dtype=float32)

This incidates each classes probabilities, for example first example has a probability of 43% being belong to class 2. You need to use argmax to find class index.

predicted_categories = np.argmax(y_pred, axis = 1)
predicted_categories[:2]

array([1, 0])

We now have the predicted classes. Now need to obtain true classes.

true_categories = tf.concat([y for x, y in val_ds], axis = 0).numpy() # convert to np array

    true_categories[:2]
>>> array([[1., 0., 0., 0., 0.],
       [0., 0., 0., 0., 1.]], dtype=float32)

If you feed this into classification report, you will get following:

ValueError: Classification metrics can't handle a mix of multilabel-indicator and multiclass targets

We need to also do:

    true_categories_argmax = np.argmax(true_categories, axis = 1)
    true_categories_argmax[:2]
>>> array([0, 4])

Now it is ready to for comparison.

print(classification_report(true_categories_argmax, predicted_categories))

That should produce the expected result:

      precision    recall  f1-score   support

   0       0.55      0.43      0.48       129
   1       0.53      0.83      0.64       176
   2       0.48      0.56      0.52       120
   3       0.75      0.72      0.73       152
   4       0.66      0.31      0.42       157

Edit: Classes might get shuffled as tf.keras.preprocessing.image_dataset_from_directory sets shuffle = True. For val_ds try to set shuffle = False. Like this:

val_ds = tf.keras.preprocessing.image_dataset_from_directory(
    main_folder,
    validation_split=0.1,
    subset="validation",
    shuffle = False,
    label_mode='categorical',
    seed=123,
    image_size=(dim, dim))

Edit2: Here is what I came up with:

prediction_classes = np.array([])
true_classes =  np.array([])

for x, y in val_ds:
  prediction_classes = np.concatenate([prediction_classes,
                       np.argmax(model.predict(x), axis = -1)])
  true_classes = np.concatenate([true_classes, np.argmax(y.numpy(), axis=-1)])

Classification Report:

print(classification_report(true_classes, prediction_classes))

              precision    recall  f1-score   support

         0.0       0.74      0.81      0.77      1162
         1.0       0.80      0.72      0.75      1179

    accuracy                           0.77      2341
   macro avg       0.77      0.77      0.76      2341
weighted avg       0.77      0.77      0.76      2341

edited Mar 01 '21 at 15:15

answered Feb 26 '21 at 16:31

Frightera

4,773
2
13
28

thanks for the explanation. I appreciate your time. One thing again, the classification report still shows a different error than the actual epoch. Please take a look at the .gif, updated part of the question. Every time I execute the line, the classification report shows me different accuracy values. – Joseph Adam Feb 26 '21 at 16:56
Minor differences might occur because of the batch_size when you execute predict(). As far as I see recall and precision values were nearly the same. Can you try with y_pred = `model.predict(val_ds, batch_size = 1)`? Also I guess your model is not trained, when you trained it properly, accuracy from classification report and `model.evaluate` should match with *minor differences* – Frightera Feb 26 '21 at 17:11
@frightera there is `model.fit`. Can you please explain by `also I guess your model is not trained`. You cannot run `model.predict` without having `model.fit` ready – PolarBear10 Feb 26 '21 at 17:25
I missed it. Your accuracy is 0.50 in 2-class classification so it means your model is random guessing. That's why I thought it was not trained. I just tried 2 class classification and everything went fine with accuracy and classification report. What's your accuracy values when training? – Frightera Feb 26 '21 at 17:29
@Frightera when I do `model.evaluate(val_ds)` I get `16/16 [==============================] - 0s 16ms/step - loss: 0.0696 - accuracy: 0.9784`. Everytime I execute `print(classification_report(true_categories_argmax, predicted_categories))` I get a different answer. That is 0.50, 0.54, 0.65,0.30. Something is wrong with the classification report code. Can you please look at the `.gif` in the question? – Joseph Adam Mar 01 '21 at 09:28
@Frightera same problen. I think sklearn is not able to handle multi-class from tensorflow. – Joseph Adam Mar 01 '21 at 10:34
Sklearn can handle it. I am out of ideas as I can not reproduce this issue, everytime I get expected results even with 2-classes. – Frightera Mar 01 '21 at 11:14
@Frightera here it is in colab code, I hope you are able to reproduce the issue. Simply, re-execute the whole notebook. Please make a copy to your drive and try to execute there so the original copy is always there. https://colab.research.google.com/drive/1pQFYKRio7JiE5UdteBu2QN9DKQZ3eHcJ#scrollTo=PALBXLpzBiq6 – Joseph Adam Mar 01 '21 at 14:04
@Frightera yes, finally looks right! Can you please add small explanation? – Joseph Adam Mar 01 '21 at 15:36
I exactly do not know how Keras shuffled the data even with you set `shuffle = False` for the dataset. But the last approach seems the safest way iterate over a dataset. – Frightera Mar 01 '21 at 16:04
Hi there... I'm having similar issues. In my case, I'm also using image_dataset_from_directory. My dataset contains 5k pics within 5 folders. With a split of 0.2 for val_ds it receives 1050 pics, nevertheless if I use shuffle = False, it just will take one of 5 classes in it, it's like taking the elements in order... And I need a full sample for my classification report... Any suggestion? I'm here because using shuffle = True I'm having heavy differences between model accuracy and classification report accuracy. Thanks in advance – albertovpd Jan 19 '22 at 20:06
If you use `shuffle = True` then the dataset will be reshuffled in each iteration. The last iteration that I've suggested should work even with `shuffle = True`. You need to iterate over the dataset with: `for x, y in ds: ...` but your label mode should be `categorical`. – Frightera Jan 19 '22 at 20:14

score 1 · Answer 2 · answered Jan 25 '22 at 16:05

I there, I was having the same issues and for me it wasn't enough to have a softmax layer and shuffle = False. In fact, setting shuffle = False in image_dataset_from_directory I had the following problem: train_ds had just 3 of 5 classes and val_ds had 2 of 5 classes (the split was done without creating heterogeneous samples)

If it helps, I was recommended to do the following:

To make just 2 folders (regardless your number of classes): train folder and validation folder. They must contain a shuffled portion of the whole.
Use the sklearn.model_selection.train_test_split with stratify parameter for labels to create the folders with heterogeneous sample.
when calling image_dataset_from_directory, do not split it into train in validation, but calling different folders (the train folder and the validation folder), now both with shuffle = False.

score 0 · Answer 3 · answered Apr 04 '23 at 20:12

A problem I faced with the accepted answer here is the fact that my dataset was batched. One way around was to set batch_size to 1 in:

val_ds = tf.keras.preprocessing.image_dataset_from_directory(
    base_dir,
    labels='inferred',
    validation_split=0.2,
    subset="validation",
    label_mode='int',
    seed=1337,
    color_mode="rgb",
    image_size=(150,150),
    batch_size=1,
   # shuffle=True
)

and then fit the model as usual to get the classification report as follows:

from sklearn.metrics import classification_report
import numpy as np
y_true = []
y_pred=[]
for images, labels in val_ds.as_numpy_iterator():
    y_pred_probs = conv_model.predict(images, batch_size=1, verbose=0)
    y_pred_classes = np.argmax(y_pred_probs, axis=1)
    y_true.extend(labels)
    y_pred.extend(y_pred_classes)

# Convert the true labels to class labels
#y_true_classes = np.argmax(y_true, axis=0)

# Generate a classification report
report = classification_report(y_true, y_pred)

print(report)

Hope this helps.

Keras: Classification report accuracy is different between model.predict accuracy for multiclass

3 Answers3

Linked