tf.keras predictions are bad while evaluation is good

Question

I'm programming a model in tf.keras, and running model.evaluate() on the training set usually yields ~96% accuracy. My evaluation on the test set is usually close, about 93%. However, when I predict manually, the model is usually inaccurate. This is my code:

import tensorflow as tf
from tensorflow.keras import layers
import numpy as np
import pandas as pd

!git clone https://github.com/DanorRon/data
%cd data
!ls

batch_size = 100
epochs = 15
alpha = 0.001
lambda_ = 0.001
h1 = 50

train = pd.read_csv('/content/data/mnist_train.csv.zip')
test = pd.read_csv('/content/data/mnist_test.csv.zip')

train = train.loc['1':'5000', :]
test = test.loc['1':'2000', :]

train = train.sample(frac=1).reset_index(drop=True)
test = test.sample(frac=1).reset_index(drop=True)

x_train = train.loc[:, '1x1':'28x28']
y_train = train.loc[:, 'label']

x_test = test.loc[:, '1x1':'28x28']
y_test = test.loc[:, 'label']

x_train = x_train.values
y_train = y_train.values

x_test = x_test.values
y_test = y_test.values

nb_classes = 10
targets = y_train.reshape(-1)
y_train_onehot = np.eye(nb_classes)[targets]

nb_classes = 10
targets = y_test.reshape(-1)
y_test_onehot = np.eye(nb_classes)[targets]

model = tf.keras.Sequential()
model.add(layers.Dense(784, input_shape=(784,), kernel_initializer='random_uniform', bias_initializer='zeros'))
model.add(layers.Dense(h1, activation='relu', kernel_regularizer=tf.keras.regularizers.l2(lambda_), kernel_initializer='random_uniform', bias_initializer='zeros'))
model.add(layers.Dense(10, activation='softmax', kernel_regularizer=tf.keras.regularizers.l2(lambda_), kernel_initializer='random_uniform', bias_initializer='zeros'))

model.compile(optimizer='SGD',
             loss = 'mse',
             metrics = ['categorical_accuracy'])

model.fit(x_train, y_train_onehot, epochs=epochs, batch_size=batch_size)

model.evaluate(x_test, y_test_onehot, batch_size=batch_size)

prediction = model.predict_classes(x_test)
print(prediction)

print(y_test[1:])

I've heard that a lot of the time when people have this problem, it's just a problem with data input. But I can't see any problem with that here since it almost always predicts wrongly (about as much as you would expect if it was random). How do I fix this problem?

Edit: Here are the specific results:

Last training step:

Epoch 15/15
49999/49999 [==============================] - 3s 70us/sample - loss: 0.0309 - categorical_accuracy: 0.9615

Evaluation output:

2000/2000 [==============================] - 0s 54us/sample - loss: 0.0352 - categorical_accuracy: 0.9310
[0.03524150168523192, 0.931]

Output from model.predict_classes:

[9 9 0 ... 5 0 5]

Output from print(y_test):

[9 0 0 7 6 8 5 1 3 2 4 1 4 5 8 4 9 2 4]

Why `print(y_test[1:])` in the code instead of `print(y_test)`? Can it be that your true labels are just starting from the second one so you are comparing with the wrong predictions? — desertnaut, Apr 01 '19 at 00:09

desertnaut · Accepted Answer · 2019-04-01T14:27:59.120

First thing is, your loss function is wrong: you are in a multi-class classification setting, and you are using a loss function suitable for regression and not classification (MSE).

Change our model compilation to:

model.compile(loss='categorical_crossentropy',
              optimizer='SGD',
              metrics=['accuracy'])

See the Keras MNIST MLP example for corroboration, and own answer in What function defines accuracy in Keras when the loss is mean squared error (MSE)? for more details (although here you actually have the inverse problem, i.e. regression loss in a classification setting).

Moreover, it is not clear if the MNIST variant you are using is already normalized; if not, you should normalize them yourself:

x_train = x_train.values/255
x_test = x_test.values/255

It is also not clear why you ask for a 784-unit layer, since this is actually the second layer of your NN (the first is implicitly set by the input_shape argument - see Keras Sequential model input layer), and it certainly does not need to contain one unit for each one of your 784 input features.

UPDATE (after comments):

But why is MSE meaningless for classification?

This is a theoretical issue, not exactly appropriate for SO; roughly speaking, it is for the same reason we don't use linear regression for classification - we use logistic regression, the actual difference between the two approaches being exactly the loss function. Andrew Ng, in his popular Machine Learning course at Coursera, explains this nicely - see his Lecture 6.1 - Logistic Regression | Classification at Youtube (explanation starts at ~ 3:00), as well as section 4.2 Why Not Linear Regression [for classification]? of the (highly recommended and freely available) textbook An Introduction to Statistical Learning by Hastie, Tibshirani and coworkers.

And MSE does give a high accuracy, so why doesn't that matter?

Nowadays, almost anything you throw at MNIST will "work", which of course neither makes it correct nor a good approach for more demanding datasets...

UPDATE 2:

whenever I run with crossentropy, the accuracy just flutters around at ~10%

Sorry, cannot reproduce the behavior... Taking the Keras MNIST MLP example with a simplified version of your model, i.e.:

model = Sequential()
model.add(Dense(784, activation='linear', input_shape=(784,)))
model.add(Dense(50, activation='relu'))
model.add(Dense(num_classes, activation='softmax'))

model.compile(loss='categorical_crossentropy',
              optimizer=SGD(),
              metrics=['accuracy'])

we easily end up with a ~ 92% validation accuracy after only 5 epochs:

history = model.fit(x_train, y_train,
                    batch_size=128,
                    epochs=5,
                    verbose=1,
                    validation_data=(x_test, y_test))

Train on 60000 samples, validate on 10000 samples
Epoch 1/10
60000/60000 [==============================] - 4s - loss: 0.8974 - acc: 0.7801 - val_loss: 0.4650 - val_acc: 0.8823
Epoch 2/10
60000/60000 [==============================] - 4s - loss: 0.4236 - acc: 0.8868 - val_loss: 0.3582 - val_acc: 0.9034
Epoch 3/10
60000/60000 [==============================] - 4s - loss: 0.3572 - acc: 0.9009 - val_loss: 0.3228 - val_acc: 0.9099
Epoch 4/10
60000/60000 [==============================] - 4s - loss: 0.3263 - acc: 0.9082 - val_loss: 0.3024 - val_acc: 0.9156
Epoch 5/10
60000/60000 [==============================] - 4s - loss: 0.3061 - acc: 0.9132 - val_loss: 0.2845 - val_acc: 0.9196

Notice the activation='linear' of the first Dense layer, which is the equivalent of not specifying anything, like in your case (as I said, practically everything you throw to MNIST will "work")...

Final advice: Try modifying your model as:

model = tf.keras.Sequential()
model.add(layers.Dense(784, activation = 'relu',input_shape=(784,)))
model.add(layers.Dense(h1, activation='relu'))
model.add(layers.Dense(10, activation='softmax'))

in order to use the better (and default) 'glorot_uniform' initializer, and remove the kernel_regularizer args (they may be the cause of any issue - always start simple!)...

The reason I have MSE as my loss function is that when I used crossentropy, a lot of strange things occurred. I have another post with that information, but I didn't get any working answers. Here is the link to the post: https://stackoverflow.com/questions/55328966/tf-keras-loss-becomes-nan — Ronan Venkat, Mar 31 '19 at 22:44
@RonanVenkat MSE is meaningless for such classification problems, and that's really non-negotiable — desertnaut, Mar 31 '19 at 22:48
@RonanVenkat please try to reproduce a Keras MNIST example first using the built-in MNIST data! God knows if the obscure MNIST variant you use are already normalized or not (you don't normalize them)...! — desertnaut, Mar 31 '19 at 22:51
I checked and the data isn't automatically normalized, just so you know. — Ronan Venkat, Mar 31 '19 at 23:39
I know I can't compare losses, but I can compare accuracy, which is what I was doing. The problem is that the manual predictions are off, but the evaluation has high accuracy. MSE has a very good evaluation with the test set, but crossentropy has a very bad evaluation, no better than a random guess. — Ronan Venkat, Mar 31 '19 at 23:57
@RonanVenkat MSE here is **meaningless** - period. As for the individual predictions, you still don't show details after the corrective actions suggested - even before, did you actually show results of `print(y_test[1:])` instead of `print(y_test)`??? Hope you just have not just shifted the true labels by one position, and then worry why they do not agree with the predictions... — desertnaut, Apr 01 '19 at 00:06
I think that was probably the problem. I was printing only the first few digits of y_test, so I was usually doing something like y_test[1:20]. I should have realized the indexing problem earlier, sorry. But why is MSE meaningless for classification? I know the formula and the derivative and I don't see why it wouldn't work. And MSE does give a high accuracy, so why doesn't that matter? — Ronan Venkat, Apr 01 '19 at 00:21
@RonanVenkat that's why we ask for a [MCVE], and not verbal assurances like "*it doesn't look right*"; for the questions, see update... — desertnaut, Apr 01 '19 at 10:51
but the difference between linear and logistic regression is not only the cost, but also the hypothesis. I’ve seen a lot of Andrew Ng’s course, so I know some theory. And if crossentropy should work, why doesn’t it work at all here? — Ronan Venkat, Apr 01 '19 at 13:40
whenever I run with crossentropy, the accuracy just flutters around at ~10%. — Ronan Venkat, Apr 01 '19 at 13:48
@RonanVenkat sorry, cannot reproduce the behavior (it may be due to the regularizers); see last update & advice (my final word on the subject)... — desertnaut, Apr 01 '19 at 14:26

tf.keras predictions are bad while evaluation is good

1 Answers1