Loss function and deep learning

Question

From deeplearning.ai :

The general methodology to build a Neural Network is to:

Define the neural network structure ( # of input units, # of hidden units, etc).

Initialize the model's parameters

Loop:

Implement forward propagation

Compute loss

Implement backward propagation to get the gradients

Update parameters (gradient descent)

How does the loss function impact how the network learns ?

For example, here is my implementation of forward and back propagation that i believe is correct as I can a train a model using below code to achieve acceptable results :

for i in range(number_iterations):


  # forward propagation


    Z1 = np.dot(weight_layer_1, xtrain) + bias_1
    a_1 = sigmoid(Z1)

    Z2 = np.dot(weight_layer_2, a_1) + bias_2
    a_2 = sigmoid(Z2)

    mse_cost = np.sum(cost_all_examples)
    cost_cross_entropy = -(1.0/len(X_train) * (np.dot(np.log(a_2), Y_train.T) + np.dot(np.log(1-a_2), (1-Y_train).T)))

#     Back propagation and gradient descent
    d_Z2 = np.multiply((a_2 - xtrain), d_sigmoid(a_2))
    d_weight_2 = np.dot(d_Z2, a_1.T)
    d_bias_2 = np.asarray(list(map(lambda x : [sum(x)] , d_Z2)))
    #   perform a parameter update in the negative gradient direction to decrease the loss
    weight_layer_2 = weight_layer_2 + np.multiply(- learning_rate , d_weight_2)
    bias_2 = bias_2 + np.multiply(- learning_rate , d_bias_2)

    d_a_1 = np.dot(weight_layer_2.T, d_Z2)
    d_Z1 = np.multiply(d_a_1, d_sigmoid(a_1))
    d_weight_1 = np.dot(d_Z1, xtrain.T)
    d_bias_1 = np.asarray(list(map(lambda x : [sum(x)] , d_Z1)))
    weight_layer_1 = weight_layer_1 + np.multiply(- learning_rate , d_weight_1)
    bias_1 = bias_1 + np.multiply(- learning_rate , d_bias_1)

Note the lines :

mse_cost = np.sum(cost_all_examples)
cost_cross_entropy = -(1.0/len(X_train) * (np.dot(np.log(a_2), Y_train.T) + np.dot(np.log(1-a_2), (1-Y_train).T)))

I can use either mse loss or cross entropy loss in order to inform how well the system is learning. But this is for informational purposes only, the choice of cost function is not impacting how the network learns. I believe I'm not understanding something fundamental as often in the deep learning literature it's stated the choice of loss function is an important step in deep learning ? But as shown in my code above I can choose cross entropy or mse loss and does not impact how network learns, cross entropy or mse loss is for informational purposes only ?

Update :

For example here is a snippet of code from deeplearning.ai that computes cost :

# GRADED FUNCTION: compute_cost

def compute_cost(A2, Y, parameters):
    """
    Computes the cross-entropy cost given in equation (13)

    Arguments:
    A2 -- The sigmoid output of the second activation, of shape (1, number of examples)
    Y -- "true" labels vector of shape (1, number of examples)
    parameters -- python dictionary containing your parameters W1, b1, W2 and b2

    Returns:
    cost -- cross-entropy cost given equation (13)
    """

    m = Y.shape[1] # number of example

    # Retrieve W1 and W2 from parameters
    ### START CODE HERE ### (≈ 2 lines of code)
    W1 = parameters['W1']
    W2 = parameters['W2']
    ### END CODE HERE ###

    # Compute the cross-entropy cost
    ### START CODE HERE ### (≈ 2 lines of code)
    logprobs = np.multiply(np.log(A2), Y) + np.multiply((1 - Y), np.log(1 - A2))
    cost = - np.sum(logprobs) / m
    ### END CODE HERE ###

    cost = np.squeeze(cost)     # makes sure cost is the dimension we expect. 
                                # E.g., turns [[17]] into 17 
    assert(isinstance(cost, float))

    return cost

This code runs as expected and achieves high accuracy / low cost. The value of the cost is not used in this implementation other than to offer information to machine learning engineer as to how well the network is learning. This causes me to question how the choice of cost function impacts how the the neural network learns ?

I'm voting to close this question as off-topic because this question is about the theory of artificial neural networks. But short answer. The loss function is very important factor in how and if the network learns. I really liked this tutorial. http://neuralnetworksanddeeplearning.com/ — Framester, Jul 19 '18 at 16:00
@Framester i also like this tutorial and think the answer lies in section http://neuralnetworksanddeeplearning.com/chap3.html , perhaps my misunderstanding is if change the cost function then activation function must also change ? In my example above although can change the loss function values it does not have an impact as not changing the activation function also. Is the gradient of the loss function equal to the gradient of the sigmoid function ? — blue-sky, Jul 19 '18 at 16:08
Do you understand the mechanics of the loss function in general: how it affects the parameter updates? I read your question as asking about choice of loss function, rather than the effect of *any* loss function. — Prune, Jul 19 '18 at 23:05
@Prune I understand the cost function measures how well the network is training. But i don't understand how it affects the parameter updates. As in my original question if I use MSE cost instead of cross entropy it has no impact on how the network learns. Perhaps the cost function choice impacts the choice of activation function ? In other words if I change the cost function does another part of my network outlined in question above also need to change in order to incorporate change in cost function. Also I've updated question. thanks. — blue-sky, Jul 20 '18 at 09:52
Thanks for the clarification; I see it got you an answer even more complete than the one I would have given. — Prune, Jul 20 '18 at 16:09

score 5 · Accepted Answer · edited Jun 20 '20 at 09:12

Well, this is just a rough high-level attempt to answer what is probably an off-topic question for SO (as I understand your puzzlement in principle).

The value of the cost is not used in this implementation other than to offer information to machine learning engineer as to how well the network is learning.

This is actually correct; reading closely Andrew Ng's Jupyter notebooks for the compute_cost function you have posted, you'll see:

5 - Cost function

Now you will implement forward and backward propagation. You need to compute the cost, because you want to check if your model is actually learning.

Literally, this is the only reason to explicitly compute the actual value of the cost function in your code.

But this is for informational purposes only, the choice of cost function is not impacting how the network learns.

Not so fast! Here is the (often invisible) catch:

The choice of the cost function is what determines the exact equations used for computing the dw and db quantities, hence the learning procedure.

Notice that here I am talking about the function itself, not its values.

In other words, calculations like your

d_weight_2 = np.dot(d_Z2, a_1.T)

and

d_weight_1 = np.dot(d_Z1, xtrain.T)

have not fallen from the sky, but they are the outcome of the back-propagation math as applied to the specific cost function.

Here are some relevant high-level slides from Andrew's introductory course at Coursera:

Hope this helps; the specifics of how exactly we arrive to the particular form of the calculations for dw and db starting from the derivative of the cost function go beyond the scope of this post, but you can find several good tutorials on back-propagation online (here is one).

Finally, for a (very) high level description of what can happen when we choose the wrong cost function (binary cross-entropy for multi-class classification, instead of the correct categorical cross-entropy), you can have a look at my answer at Keras binary_crossentropy vs categorical_crossentropy performance?.

so the equations you specify for "specific cost function" in this case is the cross entropy cost function & its derivative ? — blue-sky, Jul 20 '18 at 12:09

Loss function and deep learning

1 Answers1

5 - Cost function