MNIST Classification: mean_squared_error loss function and tanh activation function

Question

I changed the getting started example of Tensorflow as following:

import tensorflow as tf
from sklearn.metrics import roc_auc_score
import numpy as np
import commons as cm
from sklearn.metrics import confusion_matrix
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sn

mnist = tf.keras.datasets.mnist

(x_train, y_train), (x_test, y_test) = mnist.load_data()
x_train, x_test = x_train / 255.0, x_test / 255.0

model = tf.keras.models.Sequential([
  tf.keras.layers.Flatten(),
  tf.keras.layers.Dense(512, activation=tf.nn.tanh),
  # tf.keras.layers.Dense(512, activation=tf.nn.tanh),
  # tf.keras.layers.Dropout(0.2),
  tf.keras.layers.Dense(10, activation=tf.nn.tanh)
])
model.compile(optimizer='adam',
               loss='mean_squared_error',
              # loss = 'sparse_categorical_crossentropy',
              metrics=['accuracy'])

history = cm.Histories()
h= model.fit(x_train, y_train, epochs=50, callbacks=[history])
print("history:", history.losses)
cm.plot_history(h)
# cm.plot(history.losses, history.aucs)


test_predictions = model.predict(x_test)


# Compute confusion matrix
pred = np.argmax(test_predictions,axis=1)
pred2 = model.predict_classes(x_test)
confusion = confusion_matrix(y_test, pred)
cm.draw_confusion(confusion,range(10))

With its default parameters:

relu activation at hidden layers,
softmax at the output layer and
sparse_categorical_crossentropy as loss function,

it works fine and the prediction for all digits are above 99%

However with my parameters: tanh activation function and mean_squared_error loss function it just predict 0 for all test samples:

I wonder what is the problem? The accuracy rate is increasing for each epoch and it reaches 99% and loss is about 20

MSE is not an appropriate loss function for classification problems, as in your case; you may find this thread useful: [What function defines accuracy in Keras when the loss is mean squared error (MSE)?](https://stackoverflow.com/questions/48775305/what-function-defines-accuracy-in-keras-when-the-loss-is-mean-squared-error-mse/48788577#48788577) — desertnaut, Nov 19 '18 at 11:04
My output target variable Y consists of floating values lying between -1 and 1. Hence, I am willing to use 'tanh' activation function, at the last layer of my keras deep learning model. Which 'loss function' is preferred in this case? — Anurag Gupta, Feb 15 '19 at 10:57

Matthieu Brucher · Accepted Answer · 2018-11-19T12:11:20.210

2

You need to use the proper loss function for your data. Here you have a categorical output, so you need to use sparse_categorical_crossentropy, but also set from_logits without any activation for the last layer.

If you need to use tanh as your output, then you can use MSE with a one-hot encoded version of your labels + rescaling.

edited Nov 19 '18 at 12:11

answered Nov 19 '18 at 10:35

Matthieu Brucher

21,634
7
38
62

Thanks, but I had to use those functions and measure their performance. I think my mistake is that I should evaluate the categorial output in another way. – Ahmad Nov 19 '18 at 11:37
Using `tanh` for a logits output doesn't make sense (it's not between 0 and 1, and the cost functions expect unbounded values). What do you mean by "had to use thse functions"? If you want to use MSE error, use a sigmoid output, clamp the categories at (1e-7, 1-1e-7) to avoid divergence and try again. But be aware that the results won't sum to one anymore. – Matthieu Brucher Nov 19 '18 at 11:40
It's an assignment and those things are in the assignment definition, so I can't use other methods, unless they are equivalent with what I do. I think I reached a solution – Ahmad Nov 19 '18 at 11:43
Change class then? Seems like this doesn't teach you the right practices. – Matthieu Brucher Nov 19 '18 at 11:45
I guess I should convert the output to a matrix using `keras.utils.to_categorical`. However, I am not sure if it can make bipolar values for the matrix – Ahmad Nov 19 '18 at 11:47
It encodes values to 0-1. You still then need to change the output activation fucntion, tanh DOESN'T work for values of 0 and 1. – Matthieu Brucher Nov 19 '18 at 11:59
Right! I just asked https://stackoverflow.com/questions/53374154/how-to-convert-a-binary-matrix-to-a-bipolar-one-in-python to know of an easy way to do that. – Ahmad Nov 19 '18 at 12:01
I converted the matrix as you said, and now the classification works with `tanh`! – Ahmad Nov 19 '18 at 12:08
OK, let me modify the answer a little bit then. – Matthieu Brucher Nov 19 '18 at 12:09
if I rescale training data,`x_train, x_test = x_train / 255.0, x_test / 255.0`, it doesn't work! do you know why? – Ahmad Nov 19 '18 at 14:22
You said it worked? And you have the rescaling in your code here? Did you rescale twice? – Matthieu Brucher Nov 19 '18 at 14:23
Sorry, I tested it with a related tutorial, similar above solution. Later I found that he did not resale inputs. – Ahmad Nov 19 '18 at 14:28
Consider the code above, converts output to categorical and remove rescaling input so it works. But why? – Ahmad Nov 19 '18 at 14:32
That actually may not matter, you are not in an autoencoder. The different scale may be required to get a good gradient at the beginning, it all depends on the initialization. – Matthieu Brucher Nov 19 '18 at 14:33
:( but it's again among the practice conditions. It may works cause accuracy is high but the confusion matrix shows low and messed results. – Ahmad Nov 19 '18 at 14:34
The scaling of the inputs is linked to the scaling of the initial weights of the layer. they are both scaled in opposite of one another. There is no practice that forbids you to scale or not scale them for a classification network. – Matthieu Brucher Nov 19 '18 at 14:42
Oops, I found the error! if you check my code, in two ways I checked the `test_predictions`, the `pred2` I must use in my new conditions. – Ahmad Nov 19 '18 at 14:44
My output target variable Y consists of floating values lying between -1 and 1. Hence, I am willing to use 'tanh' activation function, at the last layer of my keras deep learning model. Which 'loss function' is preferred in this case? – Anurag Gupta Feb 15 '19 at 10:58
@AnuragGupta This would be a new question, and better on datascience.SE or stats.SE. – Matthieu Brucher Feb 15 '19 at 10:59

MNIST Classification: mean_squared_error loss function and tanh activation function

1 Answers1