Validation accuracy metrics reported by Keras model.fit log and Sklearn.metrics.confusion_matrix don't match each other

Question

The problem is that the reported validation accuracy value I get from Keras model.fit history is significantly higher than the validation accuracy metric I get from sklearn.metrics functions.

The results I get from model.fit are summarized below:

Last Validation Accuracy: 0.81
Best Validation Accuracy: 0.84

The results (normalized) from sklearn are pretty different:

True Negatives: 0.78
True Positives: 0.77

Validation Accuracy = (TP + TN) / (TP + TN + FP + FN) = 0.775 

(see confusion matrix below for reference)

Edit: this calculation is incorrect, because one can not 
use the normalized values to calculate the accuracy, since 
it does not account for differences in the total absolute 
number of points in the dataset. Thanks to the comment by desertnaut

Here is the graph of the validation accuracy data from model.fit history:
And here is the Confusion matrix generated from sklearn:

I think this question is somewhat similar as this one Sklearn metrics values are very different from Keras values But I've checked both methods are doing the validation on the same pool of data, so that answer is probably not adequate for my case.

Also, this question Keras binary accuracy metric gives too high accuracy seems to address some problems with the way that binary cross entropy affects a multiclass problem, but in my case it may not apply, since it is a true binary classification problem.

Here are the commands used:

Model definition:

inputs = Input((Tx, ))
n_e = 30
embeddings = Embedding(n_x, n_e, input_length=Tx)(inputs)
out = Bidirectional(LSTM(32, recurrent_dropout=0.5, return_sequences=True))(embeddings)
out = Bidirectional(LSTM(16, recurrent_dropout=0.5, return_sequences=True))(out)
out = Bidirectional(LSTM(16, recurrent_dropout=0.5))(out)
out = Dense(3, activation='softmax')(out)
modelo = Model(inputs=inputs, outputs=out)
modelo.summary()

Model Summary:

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
input_1 (InputLayer)         (None, 100)               0         
_________________________________________________________________
embedding (Embedding)        (None, 100, 30)           86610     
_________________________________________________________________
bidirectional (Bidirectional (None, 100, 64)           16128     
_________________________________________________________________
bidirectional_1 (Bidirection (None, 100, 32)           10368     
_________________________________________________________________
bidirectional_2 (Bidirection (None, 32)                6272      
_________________________________________________________________
dense (Dense)                (None, 3)                 99        
=================================================================
Total params: 119,477
Trainable params: 119,477
Non-trainable params: 0
_________________________________________________________________

Model compilation:

mymodel.compile(optimizer='adam', loss='binary_crossentropy', metrics=['acc'])

Model fit call:

num_epochs = 30
myhistory = mymodel.fit(X_pad, y, epochs=num_epochs, batch_size=50, validation_data=[X_val_pad, y_val_oh], shuffle=True, callbacks=callbacks_list)

Model fit log:

Train on 505 samples, validate on 127 samples

Epoch 1/30
500/505 [============================>.] - ETA: 0s - loss: 0.6135 - acc: 0.6667
[...]
Epoch 10/30
500/505 [============================>.] - ETA: 0s - loss: 0.1403 - acc: 0.9633
Epoch 00010: val_acc improved from 0.77953 to 0.79528, saving model to modelo-10-melhor-modelo.hdf5
505/505 [==============================] - 21s 41ms/sample - loss: 0.1393 - acc: 0.9637 - val_loss: 0.5203 - val_acc: 0.7953
Epoch 11/30
500/505 [============================>.] - ETA: 0s - loss: 0.0865 - acc: 0.9840
Epoch 00011: val_acc did not improve from 0.79528
505/505 [==============================] - 21s 41ms/sample - loss: 0.0860 - acc: 0.9842 - val_loss: 0.5257 - val_acc: 0.7953
Epoch 12/30
500/505 [============================>.] - ETA: 0s - loss: 0.0618 - acc: 0.9900
Epoch 00012: val_acc improved from 0.79528 to 0.81102, saving model to modelo-10-melhor-modelo.hdf5
505/505 [==============================] - 21s 42ms/sample - loss: 0.0615 - acc: 0.9901 - val_loss: 0.5472 - val_acc: 0.8110
Epoch 13/30
500/505 [============================>.] - ETA: 0s - loss: 0.0415 - acc: 0.9940
Epoch 00013: val_acc improved from 0.81102 to 0.82152, saving model to modelo-10-melhor-modelo.hdf5
505/505 [==============================] - 21s 42ms/sample - loss: 0.0413 - acc: 0.9941 - val_loss: 0.5853 - val_acc: 0.8215
Epoch 14/30
500/505 [============================>.] - ETA: 0s - loss: 0.0443 - acc: 0.9933
Epoch 00014: val_acc did not improve from 0.82152
505/505 [==============================] - 21s 42ms/sample - loss: 0.0453 - acc: 0.9921 - val_loss: 0.6043 - val_acc: 0.8136
Epoch 15/30
500/505 [============================>.] - ETA: 0s - loss: 0.0360 - acc: 0.9933
Epoch 00015: val_acc improved from 0.82152 to 0.84777, saving model to modelo-10-melhor-modelo.hdf5
505/505 [==============================] - 21s 42ms/sample - loss: 0.0359 - acc: 0.9934 - val_loss: 0.5663 - val_acc: 0.8478
[...]
Epoch 30/30
500/505 [============================>.] - ETA: 0s - loss: 0.0039 - acc: 1.0000
Epoch 00030: val_acc did not improve from 0.84777
505/505 [==============================] - 20s 41ms/sample - loss: 0.0039 - acc: 1.0000 - val_loss: 0.8340 - val_acc: 0.8110

Confusion matrix from sklearn:

from sklearn.metrics import confusion_matrix
conf_mat = confusion_matrix(y_values, predicted_values)

The prediction values and gold values are determined as follows:

preds = mymodel.predict(X_val)
preds_ints = [[el] for el in np.argmax(preds, axis=1)]
values_pred = tokenizer_y.sequences_to_texts(preds_ints)
values_gold = tokenizer_y.sequences_to_texts(y_val)

Finally, I'd like to add that I have printed out the data and all prediction errors and I believe the sklearn values are more reliable, since they seem to match the results I get from printing out the predictions for the saved "best" model.

On the other hand, I can't understand how the metrics can be so different. Since they are both very well know softwares, I conclude I'm the one making the mistake here, but I can't pin down where or how.

what is your metric: `acc` in your keras part? and how do you calculate predicted_values? — PV8, Sep 03 '19 at 08:52
Since we don't have acces to amount of data in each class we can not really compare keras accuracy with sklearn confusion matrix... I can't find documentation about it but from memory, keras accuracy is the mean accuracy between every batch. e.g => for 1 epoch you have 10 bach. At first batch you have 80% accuracy, model adjust weight so you have 81% accuracy at 2nd batch ect... output accuracy will be smaller than accuracy computed on all data at the end of the epoch — akhetos, Sep 03 '19 at 10:53
You don't actually show the accuracy for scikit-learn (TPs and TNs are **not** accuracy); plus, it is highly doubtful that you can actually discriminate between accuracies of 0.84 and 0.78 with such "hand" experiments. Please show the actual (i.e. *not* normalized) output of your `confusion_matrix` command; also, use the scikit-learn [`accuracy_score`](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.accuracy_score.html) method - update your post with the results — desertnaut, Sep 03 '19 at 10:54
@desertnaut Accuracy is (TP + TN)/(TP + TN + FP + FN), correct? So, I posted the normalized values, so it means Accuracy from sklearn = (0.78 + 0.77) / (2) = 0.775 — Jairo Alves, Sep 03 '19 at 11:30
@akhetos The amount of data is about equal for both classes, on the training dataset and also on the validation dataset. `FALSE 316` `TRUE 316` I'll try to understand better the way keras report validation accuracy. From your post, I see there is a possibility that it is not doing the same thing as taking the entire (held-out) validation data to generate the epoch-wise reported validation accuracy. — Jairo Alves, Sep 03 '19 at 11:34
Your computation of the accuracy is **wrong** (where exactly does this (TP+TN)/2 come from??); please see answer below. If discrepancies with Keras still exist, please do **not** alter the question - this would made the answer invalid, which indeed addressed a problem in your method. Instead, open a new question. — desertnaut, Sep 03 '19 at 12:50
Despite the clarification in your (now deleted) comment in answer below, your accuracy computation is **wrong**: it is easy to see that the denominator in your computation will *always* be 2 in any normalized confusion matrix (since each row by definition sums up to 1). Using such means to compute the accuracy when methods such as `accuracy_score` are available just adds unnecessary complexity in an already complicated issue (involving two different frameworks in a question that cannot be reproduced by canditate respondents), and is completely unjustifiable... — desertnaut, Sep 03 '19 at 22:06

desertnaut · Accepted Answer · 2019-09-03T13:16:47.747

Your question is ill-posed; as already commented, you have not computed the actual accuracy of your scikit-learn model, hence you seem to compare apples with oranges. The computation (TP + TN)/2 from a normalized confusion matrix does not give the accuracy. Here is a simple deomonstration using toy data, adapting the plot_confusion_matrix from the docs:

import numpy as np
import matplotlib.pyplot as plt
from sklearn.metrics import confusion_matrix

# toy data
y_true = [0, 1, 0, 1, 0, 0, 0, 1]
y_pred =  [1, 1, 1, 0, 1, 1, 0, 1]
class_names=[0,1]

# plot_confusion_matrix function

def plot_confusion_matrix(y_true, y_pred, classes,
                          normalize=False,
                          title=None,
                          cmap=plt.cm.Blues):
    """
    This function prints and plots the confusion matrix.
    Normalization can be applied by setting `normalize=True`.
    """
    if not title:
        if normalize:
            title = 'Normalized confusion matrix'
        else:
            title = 'Confusion matrix, without normalization'

    # Compute confusion matrix
    cm = confusion_matrix(y_true, y_pred)

    if normalize:
        cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
        print("Normalized confusion matrix")
    else:
        print('Confusion matrix, without normalization')

    print(cm)

    fig, ax = plt.subplots()
    im = ax.imshow(cm, interpolation='nearest', cmap=cmap)
    ax.figure.colorbar(im, ax=ax)
    # We want to show all ticks...
    ax.set(xticks=np.arange(cm.shape[1]),
           yticks=np.arange(cm.shape[0]),
           # ... and label them with the respective list entries
           xticklabels=classes, yticklabels=classes,
           title=title,
           ylabel='True label',
           xlabel='Predicted label')

    # Rotate the tick labels and set their alignment.
    plt.setp(ax.get_xticklabels(), rotation=45, ha="right",
             rotation_mode="anchor")

    # Loop over data dimensions and create text annotations.
    fmt = '.2f' if normalize else 'd'
    thresh = cm.max() / 2.
    for i in range(cm.shape[0]):
        for j in range(cm.shape[1]):
            ax.text(j, i, format(cm[i, j], fmt),
                    ha="center", va="center",
                    color="white" if cm[i, j] > thresh else "black")
    fig.tight_layout()
    return ax

Computing the normalized confusion matrix gives:

plot_confusion_matrix(y_true, y_pred, classes=class_names, normalize=True)
# result:
Normalized confusion matrix
[[ 0.2         0.8       ]
 [ 0.33333333  0.66666667]]

and according to your incorrect rationale, the accuracy should be:

(0.67 + 0.2)/2
# 0.435

(Notice how in the normalized matrix the rows add to 100%, something that does not happen in the full confusion matrix)

But let's now see what the real accuracy is from the un-normalized confusion matrix:

plot_confusion_matrix(y_true, y_pred, classes=class_names) # normalize=False by default
# result
Confusion matrix, without normalization
[[1 4]
 [1 2]]

from which, by the definition of accuracy as (TP + TN) / (TP + TN + FP + FN), we get:

(1+2)/(1+2+4+1)
# 0.375

Of course, we don't need the confusion matrix to get something so elementary as the accuracy; as already advised in the comments, we can simply use the built-in accuracy_score method of scikit-learn:

from sklearn.metrics import accuracy_score
accuracy_score(y_true, y_pred)
# 0.375

which, rather unsurprisingly, agrees with our direct computation from the confusion matrix.

Bottom line:

where specific methods (like accuracy_score) exist, it is definitely preferable to use them instead of ad hoc inspirations, especially when something does not look right (like a discrepancy between Keras and scikit-learn reported accuracies)
the fact that in this example the actual accuracy was lower than the one computed by your own way obviously does not say anything for the specific problem you report
if the discrepancy with Keras still exists even after computing the correct accuracy for your data, please do not alter the question with the new situation, as this would make the answer invalid, despite the fact that it highlights a mistaken point in your method - open a new question instead

Validation accuracy metrics reported by Keras model.fit log and Sklearn.metrics.confusion_matrix don't match each other

1 Answers1

Linked