What am I trying to do here? train acc: 100%, test acc: 80% does this mean overfitting?

Question

classifier.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
classifier.fit(X_train, y_train, epochs=50, batch_size=100)

Epoch 1/50
27455/27455 [==============================] - 3s 101us/step - loss: 2.9622 - acc: 0.5374

I know I'm compiling my model in first line and fitting it in second. I know what is optimiser. I'm interested the meaning of metrics=['accuracy'] and what does the acc: XXX exactly mean when I compile the model. Also, I'm getting acc : 1.000 when I train my model (100%) but when I test my model I'm getting 80% accuracy. Does my model overfitting?

score 2 · Accepted Answer · answered Mar 14 '19 at 08:48

2

Ok, let's begin from the top,

First, metrics = ['accuracy'], The model can be evaluated on multiple parameters, accuracy is one of the metrics, other can be binary_accuracy, categorical_accuracy, sparse_categorical_accuracy, top_k_categorical_accuracy, and sparse_top_k_categorical_accuracy, these are only the inbuilt ones, you can even create custom metrics, to understand metrics in more details, you need to have a clear understanding of loss in a Neural Network, you might know that loss function must be differentiable in order to be able to do back propagation, this is not necessary in case of metrics, metrics are used purely for model evaluation and thus can even be functions that are not differentiable, in Keras as mentioned even in their documentation

A metric function is similar to a loss function, except that the results from evaluating a metric are not used when training the model. You may use any of the loss functions as a metric function.

On your Own, you can custom define an accuracy that is not differentiable but creates an objective function on what you need from your model.

TLDR; Metrics are just loss functions not used in back propagation but used for model evaluation.

Now, acc:xxx might just be that it has not even finished one minibatch propagation and thus cannot give an accuracy score yet, I have not paid much attention to it, but it usually stays there for a few seconds and is thus an speculation from that.

Finally 20% Decrease in model performance when taken out of training, yes this can be a case of Overfitting but no one can know for sure without looking at your dataset, but most probably yes, it is overfitting, and you may need to look at the data it is performing bad on to know the cause.

If something is unclear, doesn't make sense, feel free to comment.

answered Mar 14 '19 at 08:48

anand_v.singh

2,768
1
16
35

1

I disagree with this: "`but most probably yes, it is overfitting, and you may need to look at the data it is performing bad on to know the cause.`. Read my answer. – Vlad Mar 14 '19 at 11:28
@Vlad I agree with the part in your answer that it doesn't necessarily overfits, thus I prefaced it with no one can know for sure without looking at the dataset, but you go on to say it almost sure doesn't overfit, which I believe is a bit disingenuous, mostly because OP has not mentioned anything about, dataset or number of trainable parameters, Also keras doesn't always mean deep networks, people have used keras for just 5-6 layers, and your answer is in extreme deep learning, I can't comment about the papers you have linked, but I will go through them as well, they seem interesting :) – anand_v.singh Mar 14 '19 at 12:33
I said that ` it almost sure doesn't overfit if your model is equipped with much more effective parameters that the number of training samples`. I haven't said that it is true for every case (btw most of the tutorials for beginners I saw have had few hundred thousands parameters > 60000 samples of MNIST). If you could show me the case where it does overfit, say simple MLP model with 1-3 hidden layers for MNIST/CIFAR with few hundred thousands parameters I will reconsider my answer. – Vlad Mar 14 '19 at 12:56
My answer is not for extreme DL, it is for any DL. I cited extremely large model to illustrate that they do not overfit even though it contradicts to classical statistical learning theory. – Vlad Mar 14 '19 at 13:00
I'm currently researching a new method that prevents overfitting. I'm working with small models which do exhibit temporal increase in test error curves. I never saw models with large number of parameters that can be overfitted in this sense, even after training for tens of thousands of epochs. Therefore, I used "almost", since it hasn't been proven, but empirical results suggest that this is the case. – Vlad Mar 14 '19 at 13:08
1

MNIST Dataset will even work after flattening the image and passing it through an autoencoder with bottleneck layer of even 8 features, with that said I went through the first paper and I am intrigued, I am not completely convinced but certainly Intrigued, if you remember, come back and comment on this when your research is published, I am looking forward to it, but as of now I am not convinced, Best of luck. – anand_v.singh Mar 14 '19 at 14:13
Thanks! +1, even though we disagree :) – Vlad Mar 14 '19 at 15:16
@Vlad Sorry to bother you again, but are you saying this will happen with all Datasets, or a small group of Datasets? – anand_v.singh Mar 15 '19 at 02:19
It impossible to know for sure, I was just saying that I have never encountered increasing test error curve with large models and this is what recent literature indicates as well `[2]`, `[5]`. – Vlad Mar 15 '19 at 12:25
@Vlad Your points and some of the research you presented were so intriguing that I have opened up a question on the same at Cross Validated site, could you Double check that I have not misrepresented you in this question https://stats.stackexchange.com/q/397547/236804. – anand_v.singh Mar 15 '19 at 12:53

Vlad · Answer 2 · 2019-03-14T13:52:59.347

Having 100% accuracy on train dataset while having 80% accuracy on test dataset doesn't mean that your model overfits. Moreover, it almost surely doesn't overfit if your model is equipped with much more effective parameters that the number of training samples [2], [5] (insanely large model example [1]). This contradicts to conventional statistical learning theory, but these are empirical results.

For models with number of parameters greater than number of samples, it's better to continue to optimize the logistic or cross-entropy loss even after the training error is zero and the training loss is extremely small, and even if the validation loss increases [3]. This may hold even regardless of batch size [4].

Clarifications (edit)

The "models" I was referring to are neural networks with two or more hidden layers (could be also convolutional layers prior to dense layers).
[1] is cited to show a clear contradiction to classical statistical learning theory, which says that large models may overfit without some form of regularization.
I would invite anyone who disagrees with "almost surely doesn't overfit" to provide a reproducible example where models, say for MNIST/CIFAR etc with few hundred thousand parameters do overfit (in a sense of increasing with iterations test error curve).

[1] Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc V. Le,Geoffrey E. Hinton, and Jeff Dean. Outrageously large neural networks: Thesparsely-gated mixture-of-experts layer.CoRR, abs/1701.06538, 2017.

[2] Lei Wu, Zhanxing Zhu, et al. Towards understanding generalization of deep learn-ing: Perspective of loss landscapes.arXiv preprint arXiv:1706.10239, 2017.

[3] Daniel Soudry, Elad Hoffer, Mor Shpigel Nacson, Suriya Gunasekar, and NathanSrebro. The implicit bias of gradient descent on separable data.The Journal of Machine Learning Research, 19(1):2822–2878, 2018.

[4] Elad Hoffer, Itay Hubara, and Daniel Soudry. Train longer, generalize better: clos-ing the generalization gap in large batch training of neural networks. InAdvancesin Neural Information Processing Systems, pages 1731–1741, 2017.`

[5] Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals.Understanding deep learning requires rethinking generalization.arXiv preprintarXiv:1611.03530, 2016.

score 0 · Answer 3 · answered Mar 14 '19 at 08:45

Starting off with the first part of your question -

Keras defines a Metric as "a function that is used to judge the performance of your model". In this case you are using accuracy as the function to judge how good your model is. (This is the norm)

For the second part of your question - acc is the accuracy of your model at that Epoch.
This can, and will change depending on which metrics were defined in the model.

Finally it is possible that you have ended up with an overfit model given what you have told us but there are simple solutions

score 0 · Answer 4 · edited Mar 14 '19 at 16:29

So the meaning of metrics=['accuracy'] is actually dependent on what loss function you use. You can see how keras handels this from line 375 and down. Since you are using categorical_crossentropy, your case follows the logic in the elif (line 386). Hence your metric function is set to

metric_fn = metrics_module.sparse_categorical_accuracy

See this post for a description of the logic behind sparse_categorical_accuracy, it should clear the meaning of "accuracy" in your case. It basically just counts how many of your prediction (the class with maximum probability) was the same as the true class.

The train vs validation accuracy can show sign of over-fitting. To test this plot the train accuracy and validation accuracy against each other and see at what point the validation accuracy start to decrease. Follow this for a good description of how to plot accuracy and loss etc, to test for over-fitting.

What am I trying to do here? train acc: 100%, test acc: 80% does this mean overfitting?

4 Answers4

Clarifications (edit)