Sklearn metrics values are very different from Keras values

Question

I need some help in order to understand how accuracy is calculated when fitting a model in Keras. This is a sample history of training the model:

Train on 340 samples, validate on 60 samples
Epoch 1/100
340/340 [==============================] - 5s 13ms/step - loss: 0.8081 - acc: 0.7559 - val_loss: 0.1393 - val_acc: 1.0000
Epoch 2/100
340/340 [==============================] - 3s 9ms/step - loss: 0.7815 - acc: 0.7647 - val_loss: 0.1367 - val_acc: 1.0000
Epoch 3/100
340/340 [==============================] - 3s 10ms/step - loss: 0.8042 - acc: 0.7706 - val_loss: 0.1370 - val_acc: 1.0000
...
Epoch 25/100
340/340 [==============================] - 3s 9ms/step - loss: 0.6006 - acc: 0.8029 - val_loss: 0.2418 - val_acc: 0.9333
Epoch 26/100
340/340 [==============================] - 3s 9ms/step - loss: 0.5799 - acc: 0.8235 - val_loss: 0.3004 - val_acc: 0.8833

So, validation accuracy is 1 in the first epochs? How can the validation accuracy be better than the training accuracy?

This are figures that show all values of accuracy and loss:

Then I use sklearn metrics to evaluate final results:

def evaluate(predicted_outcome, expected_outcome):
    f1_score = metrics.f1_score(expected_outcome, predicted_outcome, average='weighted')
    balanced_accuracy_score = metrics.balanced_accuracy_score(expected_outcome, predicted_outcome)
    print('****************************')
    print('| MODEL PERFORMANCE REPORT |')
    print('****************************')
    print('Average F1 score = {:0.2f}.'.format(f1_score))
    print('Balanced accuracy score = {:0.2f}.'.format(balanced_accuracy_score))
    print('Confusion matrix')
    print(metrics.confusion_matrix(expected_outcome, predicted_outcome))
    print('Other metrics')
    print(metrics.classification_report(expected_outcome, predicted_outcome))

I get this output (as you can see, the results are terrible):

****************************
| MODEL PERFORMANCE REPORT |
****************************
Average F1 score = 0.25.
Balanced accuracy score = 0.32.
Confusion matrix
[[  7  24   2  40]
 [ 11  70   4 269]
 [  0   0   0  48]
 [  0   0   0   6]]
Other metrics
              precision    recall  f1-score   support

           0       0.39      0.10      0.15        73
           1       0.74      0.20      0.31       354
           2       0.00      0.00      0.00        48
           3       0.02      1.00      0.03         6

   micro avg       0.17      0.17      0.17       481
   macro avg       0.29      0.32      0.12       481
weighted avg       0.61      0.17      0.25       481

Why the accuracy and loss values of Keras fit functions are so different from the values of sklearn metrics?

This is my model, in case it helps:

model = Sequential()

model.add(LSTM(
                units=100, # the number of hidden states
                return_sequences=True, 
                input_shape=(timestamps,nb_features),
                dropout=0.2, 
                recurrent_dropout=0.2
              )
         )

model.add(Dropout(0.2))

model.add(Flatten())

model.add(Dense(units=nb_classes,
               activation='softmax'))

model.compile(loss="categorical_crossentropy",
              metrics = ['accuracy'],
              optimizer='adadelta')

Input data dimensions:

400 train sequences
481 test sequences
X_train shape: (400, 20, 17)
X_test shape: (481, 20, 17)
y_train shape: (400, 4)
y_test shape: (481, 4)

This is how I apply sklearn metrics:

testPredict = model.predict(np.array(X_test))
y_test = np.argmax(y_test.values, axis=1)
y_pred = np.argmax(testPredict, axis=1)
evaluate(y_pred, y_test)

It looks that I miss something.

desertnaut · Accepted Answer · 2019-02-07T22:27:16.060

3

You sound a little confused.

To start with, you are comparing apples to oranges, i.e. the validation accuracy reported by Keras on a 60-sample set (notice the first informative message printed by Keras, Train on 340 samples, validate on 60 samples) with the test accuracy reported by scikit-learn on your 481-sample test set.

Second, your validation set of only 60 samples is way too small; in such small samples, wild fluctuations of the calculated metrics such as the ones you report are certainly not unexpected (there is a reason why we need datasets of sufficient size, and not only training ones).

Third, your training/validation/test set division is quite unusual, to say the least; standard practice asks for allocations of roughly 70/15/15 per cent or similar, while you are using an allocation of 38/7/55 per cent (i.e. 340/60/481 samples)...

Lastly, and without knowing the details of your data, it may very well be the case that only 340 samples are not enough to fit an LSTM model such as yours for a good 4-class classification task.

For starters, start with a more appropriate allocation of your data into training/validation/test sets and be sure you compare apples to apples...

PS In similar questions, you should also include your model.fit() part.

edited Feb 07 '19 at 22:27

answered Feb 07 '19 at 21:44

desertnaut

57,590
26
140
166

2

Hi, I'm facing a very similar situation. After reading your answer, I checked that both Sklearn and Keras are using the same val set size anda data. However, the Keras model.fit output shows 85% val_acc. At the same time, on Sklearn metrics.confusion matrix, it shows 78% for True Positives and also 78% for True Negatives. Should I ask it on a new question or would it be a duplication of this one? Thank you! – Jairo Alves Sep 02 '19 at 22:01
1

@JairoAlves I suggest you open a new question indeed – desertnaut Sep 02 '19 at 22:04
1

@JairoAlves one year from the future, what did you do? I have the same exact issue. Equal numbers in keras and sklearn, yet one reports 90% and the other 56%. – PolarBear10 Nov 16 '20 at 18:42
@PolarBear10 looks like they opened [this question](https://stackoverflow.com/questions/57763363/validation-accuracy-metrics-reported-by-keras-model-fit-log-and-sklearn-metrics), which I seem to have answered, too. – desertnaut Nov 17 '20 at 00:06

Sklearn metrics values are very different from Keras values

1 Answers1

Linked