How to determine an overfitted model based on loss precision and recall

Question

I've written an LSTM network with Keras (following code):

    df = pd.read_csv("../data/training_data.csv")

    # Group by and pivot the data
    group_index = df.groupby('group').cumcount()
    data = (df.set_index(['group', group_index])
            .unstack(fill_value=0).stack())

    # getting np array of the data and labeling
    # on the label group we take the first label because it is the same for all
    target = np.array(data['label'].groupby(level=0).apply(lambda x: [x.values[0]]).tolist())
    data = data.loc[:, data.columns != 'label']
    data = np.array(data.groupby(level=0).apply(lambda x: x.values.tolist()).tolist())

    # shuffel the training set
    data, target = shuffle(data, target)

    # spilt data to train and test
    x_train, x_test, y_train, y_test = train_test_split(data, target, test_size=0.2, random_state=4)

    # ADAM Optimizer with learning rate decay
    opt = optimizers.Adam(lr=0.0001, beta_1=0.9, beta_2=0.999, epsilon=1e-08, decay=0.0001)

    # build the model
    model = Sequential()

    num_features = data.shape[2]
    num_samples = data.shape[1]

    model.add(LSTM(8, batch_input_shape=(None, num_samples, num_features), return_sequences=True, activation='sigmoid'))
    model.add(LeakyReLU(alpha=.001))
    model.add(Dropout(0.2))
    model.add(LSTM(4, return_sequences=True, activation='sigmoid'))
    model.add(LeakyReLU(alpha=.001))
    model.add(Flatten())
    model.add(Dense(1, activation='sigmoid'))

    model.compile(loss='binary_crossentropy', optimizer=opt,
                  metrics=['accuracy', keras_metrics.precision(), keras_metrics.recall(),f1])

    model.summary()


    # Training, getting the results history for plotting
    history = model.fit(x_train, y_train, epochs=3000, validation_data=(x_test, y_test))

The monitored metrics are loss, accuracy, precision, recall and f1 score.

I've noticed that the validation loss metric start to climb around 300 epochs, so I've figured overfitting! however, recall is still climbing and precision is slightly improving.

Why is that? is my model overfitted?

The interaction between the loss & "business" metrics (like precision & recall here) is indeed a delicate and rather under-explored one. Not an answer to your exact question, but you may get some useful ideas from my response in [Loss & accuracy - Are these reasonable learning curves?](https://stackoverflow.com/questions/47817424/loss-accuracy-are-these-reasonable-learning-curves/47819022#47819022) — desertnaut, Oct 16 '18 at 09:03
Thanks for your reply, After reading your post, I'm more convinced that the training process should be stopped when the validation loss metric rise. I still not fully understand why the loss rise while the precision and recall are improving. — Shlomi Schwartz, Oct 16 '18 at 09:14
As I said, this is a (very) under-explored topic in the literature. Your point makes sense, but I can easily imagine a counter-argument that "we should continue as long as our *business* metric improves" - since, at the end of the day, it is the *business* metrics we actually care about, isn't it? — desertnaut, Oct 16 '18 at 09:23
Nicely described, I think I need to validate those two approaches in the real world — Shlomi Schwartz, Oct 16 '18 at 10:44
Now, I think that's exactly the correct thing to do here... :) — desertnaut, Oct 16 '18 at 10:54
It looks like binary classification from the last layer of the network. Can you please add the data set skewness information. — Venkatachalam, Oct 24 '18 at 02:53

miraculixx · Accepted Answer · 2018-10-26T13:06:25.220

the validation loss metric start to climb around 300 epochs (...) recall is still climbing and precision is slightly improving. (...) Why is that?

Precision and recall are measures of how well your classifier performs in terms of the predicted class labels. Model loss on the other hand is a measure of the cross entropy, the error in classification probability :

where

y = predicted label
p = probability of predicted label

For example, the (softmax) outputs of the model for one observation may look like this for different epochs, say

# epoch 300
y = [0.1, 0.9] => argmax(y) => 1 (class label 1)
loss = -(1 * log(0.9)) = 0.10

# epoch 500
y = [0.4, 0.6] => argmax(y) => 1 (class label 1)
loss = -(1 * log(0.6)) = 0.51

In both cases the precision and recall metrics will stay unchanged (the class label is still predicted correctly), however the model loss has increased. In general terms, the model has become "less sure" about it's prediction, but it is still correct.

Note in your model the loss is calculated for all observations, not just a single one. I limit the discussion for simplicity. The loss formula is trivially expanded to n > 1 observations by taking the average of the loss of all observations.

is my model overfitted?

In order to determine this, you have to compare training loss and validation loss. You cannot tell by validation loss alone. If training loss decreases and validation loss increases, your model is overfitting.

can you please have a look here : https://stackoverflow.com/questions/65305864/understanding-weightedkappaloss-using-keras ? — Shlomi Schwartz, Dec 15 '20 at 13:46

Matthieu Brucher · Answer 2 · 2018-10-24T18:01:20.690

Indeed, if the validation loss starts growing up again, then you may want to stop early. It's a "standard" approach, named "early stopping" (https://en.wikipedia.org/wiki/Early_stopping). Clearly, if the loss for your validation and data is increasing, then the model is not doing as great as it could, it is overfitting.

Precision and recall are not enough, they can increase if your model is giving more positive results, less negative ones (for instance 9 positives for 1 negative). Then these ratios can seem to be improved, but it's just that you have less true negatives.

All these two put together can help shed some light as to what is happening here. The good answers may still be good ones, but with a lower quality (the loss for individual samples increases on average, but still keeps good answers good), and there could be a shift from bad answers to good answers with a bias (true negatives are transformed into false positives).

score 1 · Answer 3 · answered Oct 24 '18 at 02:50

AS @Matthieu mentioned, it could be biased to look at precision and recall of one class alone. May be we have to look at performance on other class as well.

Better measure could be concordance (auc of roc), in case of a binary classification. Concordance measure the goodness of the model to rank-order the datapoints based on its likeliness towards to a class.

One more option is Macro/Micro-precision/recall to get the complete picture of the model performance.

How to determine an overfitted model based on loss precision and recall

3 Answers3