2

If I correctly understood the significance of the loss function to the model, it directs the model to be trained based on minimizing the loss value. So for example, if I want my model to be trained in order to have the least mean absolute error, i should use the MAE as the loss function. Why is it, for example, sometimes you see someone wanting to achieve the best accuracy possible, but building the model to minimize another completely different function? For example:

model.compile(loss='mean_squared_error', optimizer='sgd', metrics='acc')

How come the model above is trained to give us the best acc, since during it's training it will try to minimize another function (MSE). I know that, when already trained, the metric of the model will give us the best acc found during the training.

My doubt is: shouldn't the focus of the model during it's training to maximize acc (or minimize 1/acc) instead of minimizing MSE? If done in that way, wouldn't the model give us even higher accuracy, since it knows it has to maximize it during it's training?

desertnaut
  • 57,590
  • 26
  • 140
  • 166
daniellga
  • 1,142
  • 6
  • 16
  • The metric `accuracy` could be thought of `number_correct / total`. This *is* what you care about. In the end you want to get a high accuracy. But how do you get there? You can't backpropogate values for accuracy and update. What you can do, however, is use a loss function to minimize. As you minimize the loss you also increase accuracy. Think about what `sgd` does. What direction does it go? What does it do? Helps find the minimum. How so? There is a reason the loss functions designed to be easy to take the derivative. You might want to first better understand how ANNs' work. – Chrispresso Jun 07 '19 at 17:13

1 Answers1

3

To start with, the code snippet you have used as example:

model.compile(loss='mean_squared_error', optimizer='sgd', metrics='acc')

is actually invalid (although Keras will not produce any error or warning) for a very simple and elementary reason: MSE is a valid loss for regression problems, for which problems accuracy is meaningless (it is meaningful only for classification problems, where MSE is not a valid loss function). For details (including a code example), see own answer in What function defines accuracy in Keras when the loss is mean squared error (MSE)?; for a similar situation in scikit-learn, see own answer in this thread.

Continuing to your general question: in regression settings, usually we don't need a separate performance metric, and we normally use just the loss function itself for this purpose, i.e. the correct code for the example you have used would simply be

model.compile(loss='mean_squared_error', optimizer='sgd')

without any metrics specified. We could of course use metrics='mse', but this is redundant and not really needed. Sometimes people use something like

model.compile(loss='mean_squared_error', optimizer='sgd', metrics=['mse','mae'])

i.e. optimise the model according to the MSE loss, but show also its performance in the mean absolute error (MAE) in addition to MSE.

Now, your question:

shouldn't the focus of the model during its training to maximize acc (or minimize 1/acc) instead of minimizing MSE?

is indeed valid, at least in principle (save for the reference to MSE), but only for classification problems, where, roughly speaking, the situation is as follows: we cannot use the vast arsenal of convex optimization methods in order to directly maximize the accuracy, because accuracy is not a differentiable function; so, we need a proxy differentiable function to use as loss. The most common example of such a loss function suitable for classification problems is the cross entropy.

Rather unsurprisingly, this question of yours pops up from time to time, albeit in slight variations in context; see for example own answers in

For the interplay between loss and accuracy in the special case of binary classification, you may find my answers in the following threads useful:

desertnaut
  • 57,590
  • 26
  • 140
  • 166
  • why accuracy is not a differential function though? Is it because of the `argmax` in `number_correct` only? If so, it might be possible to use a smoother sampler to make it differentiable. – liang Oct 18 '21 at 04:01