The problem is that the reported validation accuracy
value I get from Keras model.fit
history is significantly higher than the validation accuracy
metric I get from sklearn.metrics
functions.
The results I get from model.fit
are summarized below:
Last Validation Accuracy: 0.81
Best Validation Accuracy: 0.84
The results (normalized) from sklearn
are pretty different:
True Negatives: 0.78
True Positives: 0.77
Validation Accuracy = (TP + TN) / (TP + TN + FP + FN) = 0.775
(see confusion matrix below for reference)
Edit: this calculation is incorrect, because one can not
use the normalized values to calculate the accuracy, since
it does not account for differences in the total absolute
number of points in the dataset. Thanks to the comment by desertnaut
Here is the graph of the validation accuracy data from model.fit history:
And here is the Confusion matrix generated from sklearn:
I think this question is somewhat similar as this one Sklearn metrics values are very different from Keras values But I've checked both methods are doing the validation on the same pool of data, so that answer is probably not adequate for my case.
Also, this question Keras binary accuracy metric gives too high accuracy seems to address some problems with the way that binary cross entropy affects a multiclass problem, but in my case it may not apply, since it is a true binary classification problem.
Here are the commands used:
Model definition:
inputs = Input((Tx, ))
n_e = 30
embeddings = Embedding(n_x, n_e, input_length=Tx)(inputs)
out = Bidirectional(LSTM(32, recurrent_dropout=0.5, return_sequences=True))(embeddings)
out = Bidirectional(LSTM(16, recurrent_dropout=0.5, return_sequences=True))(out)
out = Bidirectional(LSTM(16, recurrent_dropout=0.5))(out)
out = Dense(3, activation='softmax')(out)
modelo = Model(inputs=inputs, outputs=out)
modelo.summary()
Model Summary:
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
input_1 (InputLayer) (None, 100) 0
_________________________________________________________________
embedding (Embedding) (None, 100, 30) 86610
_________________________________________________________________
bidirectional (Bidirectional (None, 100, 64) 16128
_________________________________________________________________
bidirectional_1 (Bidirection (None, 100, 32) 10368
_________________________________________________________________
bidirectional_2 (Bidirection (None, 32) 6272
_________________________________________________________________
dense (Dense) (None, 3) 99
=================================================================
Total params: 119,477
Trainable params: 119,477
Non-trainable params: 0
_________________________________________________________________
Model compilation:
mymodel.compile(optimizer='adam', loss='binary_crossentropy', metrics=['acc'])
Model fit call:
num_epochs = 30
myhistory = mymodel.fit(X_pad, y, epochs=num_epochs, batch_size=50, validation_data=[X_val_pad, y_val_oh], shuffle=True, callbacks=callbacks_list)
Model fit log:
Train on 505 samples, validate on 127 samples
Epoch 1/30
500/505 [============================>.] - ETA: 0s - loss: 0.6135 - acc: 0.6667
[...]
Epoch 10/30
500/505 [============================>.] - ETA: 0s - loss: 0.1403 - acc: 0.9633
Epoch 00010: val_acc improved from 0.77953 to 0.79528, saving model to modelo-10-melhor-modelo.hdf5
505/505 [==============================] - 21s 41ms/sample - loss: 0.1393 - acc: 0.9637 - val_loss: 0.5203 - val_acc: 0.7953
Epoch 11/30
500/505 [============================>.] - ETA: 0s - loss: 0.0865 - acc: 0.9840
Epoch 00011: val_acc did not improve from 0.79528
505/505 [==============================] - 21s 41ms/sample - loss: 0.0860 - acc: 0.9842 - val_loss: 0.5257 - val_acc: 0.7953
Epoch 12/30
500/505 [============================>.] - ETA: 0s - loss: 0.0618 - acc: 0.9900
Epoch 00012: val_acc improved from 0.79528 to 0.81102, saving model to modelo-10-melhor-modelo.hdf5
505/505 [==============================] - 21s 42ms/sample - loss: 0.0615 - acc: 0.9901 - val_loss: 0.5472 - val_acc: 0.8110
Epoch 13/30
500/505 [============================>.] - ETA: 0s - loss: 0.0415 - acc: 0.9940
Epoch 00013: val_acc improved from 0.81102 to 0.82152, saving model to modelo-10-melhor-modelo.hdf5
505/505 [==============================] - 21s 42ms/sample - loss: 0.0413 - acc: 0.9941 - val_loss: 0.5853 - val_acc: 0.8215
Epoch 14/30
500/505 [============================>.] - ETA: 0s - loss: 0.0443 - acc: 0.9933
Epoch 00014: val_acc did not improve from 0.82152
505/505 [==============================] - 21s 42ms/sample - loss: 0.0453 - acc: 0.9921 - val_loss: 0.6043 - val_acc: 0.8136
Epoch 15/30
500/505 [============================>.] - ETA: 0s - loss: 0.0360 - acc: 0.9933
Epoch 00015: val_acc improved from 0.82152 to 0.84777, saving model to modelo-10-melhor-modelo.hdf5
505/505 [==============================] - 21s 42ms/sample - loss: 0.0359 - acc: 0.9934 - val_loss: 0.5663 - val_acc: 0.8478
[...]
Epoch 30/30
500/505 [============================>.] - ETA: 0s - loss: 0.0039 - acc: 1.0000
Epoch 00030: val_acc did not improve from 0.84777
505/505 [==============================] - 20s 41ms/sample - loss: 0.0039 - acc: 1.0000 - val_loss: 0.8340 - val_acc: 0.8110
Confusion matrix from sklearn:
from sklearn.metrics import confusion_matrix
conf_mat = confusion_matrix(y_values, predicted_values)
The prediction values and gold values are determined as follows:
preds = mymodel.predict(X_val)
preds_ints = [[el] for el in np.argmax(preds, axis=1)]
values_pred = tokenizer_y.sequences_to_texts(preds_ints)
values_gold = tokenizer_y.sequences_to_texts(y_val)
Finally, I'd like to add that I have printed out the data and all prediction errors and I believe the sklearn values are more reliable, since they seem to match the results I get from printing out the predictions for the saved "best" model.
On the other hand, I can't understand how the metrics can be so different. Since they are both very well know softwares, I conclude I'm the one making the mistake here, but I can't pin down where or how.