Cost function training target versus accuracy desired goal

Question

When we train neural networks, we typically use gradient descent, which relies on a continuous, differentiable real-valued cost function. The final cost function might, for example, take the mean squared error. Or put another way, gradient descent implicitly assumes the end goal is regression - to minimize a real-valued error measure.

Sometimes what we want a neural network to do is perform classification - given an input, classify it into two or more discrete categories. In this case, the end goal the user cares about is classification accuracy - the percentage of cases classified correctly.

But when we are using a neural network for classification, though our goal is classification accuracy, that is not what the neural network is trying to optimize. The neural network is still trying to optimize the real-valued cost function. Sometimes these point in the same direction, but sometimes they don't. In particular, I've been running into cases where a neural network trained to correctly minimize the cost function, has a classification accuracy worse than a simple hand-coded threshold comparison.

I've boiled this down to a minimal test case using TensorFlow. It sets up a perceptron (neural network with no hidden layers), trains it on an absolutely minimal dataset (one input variable, one binary output variable) assesses the classification accuracy of the result, then compares it to the classification accuracy of a simple hand-coded threshold comparison; the results are 60% and 80% respectively. Intuitively, this is because a single outlier with a large input value, generates a correspondingly large output value, so the way to minimize the cost function is to try extra hard to accommodate that one case, in the process misclassifying two more ordinary cases. The perceptron is correctly doing what it was told to do; it's just that this does not match what we actually want of a classifier. But the classification accuracy is not a continuous differentiable function, so we can't use it as the target for gradient descent.

How can we train a neural network so that it ends up maximizing classification accuracy?

import numpy as np
import tensorflow as tf
sess = tf.InteractiveSession()
tf.set_random_seed(1)

# Parameters
epochs = 10000
learning_rate = 0.01

# Data
train_X = [
    [0],
    [0],
    [2],
    [2],
    [9],
]
train_Y = [
    0,
    0,
    1,
    1,
    0,
]

rows = np.shape(train_X)[0]
cols = np.shape(train_X)[1]

# Inputs and outputs
X = tf.placeholder(tf.float32)
Y = tf.placeholder(tf.float32)

# Weights
W = tf.Variable(tf.random_normal([cols]))
b = tf.Variable(tf.random_normal([]))

# Model
pred = tf.tensordot(X, W, 1) + b
cost = tf.reduce_sum((pred-Y)**2/rows)
optimizer = tf.train.GradientDescentOptimizer(learning_rate).minimize(cost)
tf.global_variables_initializer().run()

# Train
for epoch in range(epochs):
    # Print update at successive doublings of time
    if epoch&(epoch-1) == 0 or epoch == epochs-1:
        print('{} {} {} {}'.format(
            epoch,
            cost.eval({X: train_X, Y: train_Y}),
            W.eval(),
            b.eval(),
            ))
    optimizer.run({X: train_X, Y: train_Y})

# Classification accuracy of perceptron
classifications = [pred.eval({X: x}) > 0.5 for x in train_X]
correct = sum([p == y for (p, y) in zip(classifications, train_Y)])
print('{}/{} = perceptron accuracy'.format(correct, rows))

# Classification accuracy of hand-coded threshold comparison
classifications = [x[0] > 1.0 for x in train_X]
correct = sum([p == y for (p, y) in zip(classifications, train_Y)])
print('{}/{} = threshold accuracy'.format(correct, rows))

Just to be sure I understand your question: are you really asking, here at SO, *how can we* [escape from the whole edifice of the mathematical optimization of continuous functions, so that we can optimize directly the accuracy (preferably demonstrating this in a short Tensorflow script)]? Or you are really asking for an explanation/intuition on *why can't we* [optimize directly the accuracy, instead of proxies such as the loss]? Admittedly, and given the forum (SO), the first question would be absurd, while the second is arguably a legitimate one... — desertnaut, Dec 19 '17 at 22:19
@desertnaut I'm asking for a way to get a continuous proxy function that's closer to the accuracy, or an explanation of why there isn't one, or an explanation of why I'm overlooking something I haven't thought of that makes the problem go away, without prejudice regarding which of those is more likely. — rwallace, Dec 19 '17 at 22:25

desertnaut · Accepted Answer · 2021-06-08T13:35:22.050

How can we train a neural network so that it ends up maximizing classification accuracy?

I'm asking for a way to get a continuous proxy function that's closer to the accuracy

To start with, the loss function used today for classification tasks in (deep) neural nets was not invented with them, but it goes back several decades, and it actually comes from the early days of logistic regression. Here is the equation for the simple case of binary classification:

The idea behind it was exactly to come up with a continuous & differentiable function, so that we would be able to exploit the (vast, and still expanding) arsenal of convex optimization for classification problems.

It is safe to say that the above loss function is the best we have so far, given the desired mathematical constraints mentioned above.

Should we consider this problem (i.e. better approximating the accuracy) solved and finished? At least in principle, no. I am old enough to remember an era when the only activation functions practically available were tanh and sigmoid; then came ReLU and gave a real boost to the field. Similarly, someone may eventually come up with a better loss function, but arguably this is going to happen in a research paper, and not as an answer to a SO question...

That said, the very fact that the current loss function comes from very elementary considerations of probability and information theory (fields that, in sharp contrast with the current field of deep learning, stand upon firm theoretical foundations) creates at least some doubt as to if a better proposal for the loss may be just around the corner.

There is another subtle point on the relation between loss and accuracy, which makes the latter something qualitatively different than the former, and is frequently lost in such discussions. Let me elaborate a little...

All the classifiers related to this discussion (i.e. neural nets, logistic regression etc) are probabilistic ones; that is, they do not return hard class memberships (0/1) but class probabilities (continuous real numbers in [0, 1]).

Limiting the discussion for simplicity to the binary case, when converting a class probability to a (hard) class membership, we are implicitly involving a threshold, usually equal to 0.5, such as if p[i] > 0.5, then class[i] = "1". Now, we can find many cases whet this naive default choice of threshold will not work (heavily imbalanced datasets are the first to come to mind), and we'll have to choose a different one. But the important point for our discussion here is that this threshold selection, while being of central importance to the accuracy, is completely external to the mathematical optimization problem of minimizing the loss, and serves as a further "insulation layer" between them, compromising the simplistic view that loss is just a proxy for accuracy (it is not). As nicely put in the answer of this Cross Validated thread:

the statistical component of your exercise ends when you output a probability for each class of your new sample. Choosing a threshold beyond which you classify a new observation as 1 vs. 0 is not part of the statistics any more. It is part of the decision component.

Enlarging somewhat an already broad discussion: Can we possibly move completely away from the (very) limiting constraint of mathematical optimization of continuous & differentiable functions? In other words, can we do away with back-propagation and gradient descend?

Well, we are actually doing so already, at least in the sub-field of reinforcement learning: 2017 was the year when new research from OpenAI on something called Evolution Strategies made headlines. And as an extra bonus, here is an ultra-fresh (Dec 2017) paper by Uber on the subject, again generating much enthusiasm in the community.

Bar · Answer 2 · 2017-12-20T12:12:22.073

I think you are forgetting to pass your output through a simgoid. Fixed below:

import numpy as np
import tensorflow as tf
sess = tf.InteractiveSession()
tf.set_random_seed(1)

# Parameters
epochs = 10000
learning_rate = 0.01

# Data
train_X = [
    [0],
    [0],
    [2],
    [2],
    [9],
]
train_Y = [
    0,
    0,
    1,
    1,
    0,
]

rows = np.shape(train_X)[0]
cols = np.shape(train_X)[1]

# Inputs and outputs
X = tf.placeholder(tf.float32)
Y = tf.placeholder(tf.float32)

# Weights
W = tf.Variable(tf.random_normal([cols]))
b = tf.Variable(tf.random_normal([]))

# Model
# CHANGE HERE: Remember, you need an activation function!
pred = tf.nn.sigmoid(tf.tensordot(X, W, 1) + b)
cost = tf.reduce_sum((pred-Y)**2/rows)
optimizer = tf.train.GradientDescentOptimizer(learning_rate).minimize(cost)
tf.global_variables_initializer().run()

# Train
for epoch in range(epochs):
    # Print update at successive doublings of time
    if epoch&(epoch-1) == 0 or epoch == epochs-1:
        print('{} {} {} {}'.format(
            epoch,
            cost.eval({X: train_X, Y: train_Y}),
            W.eval(),
            b.eval(),
            ))
    optimizer.run({X: train_X, Y: train_Y})

# Classification accuracy of perceptron
classifications = [pred.eval({X: x}) > 0.5 for x in train_X]
correct = sum([p == y for (p, y) in zip(classifications, train_Y)])
print('{}/{} = perceptron accuracy'.format(correct, rows))

# Classification accuracy of hand-coded threshold comparison
classifications = [x[0] > 1.0 for x in train_X]
correct = sum([p == y for (p, y) in zip(classifications, train_Y)])
print('{}/{} = threshold accuracy'.format(correct, rows))

The output:

0 0.28319069743156433 [ 0.75648874] -0.9745011329650879
1 0.28302448987960815 [ 0.75775659] -0.9742625951766968
2 0.28285878896713257 [ 0.75902224] -0.9740257859230042
4 0.28252947330474854 [ 0.76154679] -0.97355717420578
8 0.28187844157218933 [ 0.76656926] -0.9726400971412659
16 0.28060704469680786 [ 0.77650583] -0.970885694026947
32 0.27818527817726135 [ 0.79593837] -0.9676888585090637
64 0.2738055884838104 [ 0.83302218] -0.9624817967414856
128 0.26666420698165894 [ 0.90031379] -0.9562843441963196
256 0.25691407918930054 [ 1.01172411] -0.9567816257476807
512 0.2461051195859909 [ 1.17413962] -0.9872989654541016
1024 0.23519910871982574 [ 1.38549554] -1.088881492614746
2048 0.2241383194923401 [ 1.64616168] -1.298340916633606
4096 0.21433120965957642 [ 1.95981205] -1.6126530170440674
8192 0.2075471431016922 [ 2.31746769] -1.989408016204834
9999 0.20618653297424316 [ 2.42539024] -2.1028473377227783
4/5 = perceptron accuracy
4/5 = threshold accuracy

Thanks! It does seem reasonable that the sigmoid might help. When I try your code, it still doesn't work, but I think that's because your TF is using a different random number sequence. When I try your starting W/b, it does work... — rwallace, Dec 19 '17 at 19:23
With my starting W/b it still doesn't work, even with the same code, it reaches a different endpoint, but that's odd, perceptrons are supposed to always converge to the global optimum, they aren't supposed to have local optima. Still trying to figure out what's going on with that. — rwallace, Dec 19 '17 at 19:25
Okay, I remembered 'perceptrons always converge to global optimum' but forgot the second half of the correct version 'if the data is linearly separable'. So the global optimum does indeed score 4/5 with the sigmoid. — rwallace, Dec 20 '17 at 21:45
That is correct, your problem cannot be perfectly solved by single layer network (look at the XOR problem http://home.agh.edu.pl/~vlsi/AI/xor_t/en/main.htm). Try adding a second layer and see what happens though. — Bar, Dec 21 '17 at 16:33

Cost function training target versus accuracy desired goal

2 Answers2

Linked

Related