From deeplearning.ai :
The general methodology to build a Neural Network is to:
- Define the neural network structure ( # of input units, # of hidden units, etc).
- Initialize the model's parameters
- Loop:
- Implement forward propagation
- Compute loss
- Implement backward propagation to get the gradients
- Update parameters (gradient descent)
How does the loss function impact how the network learns ?
For example, here is my implementation of forward and back propagation that i believe is correct as I can a train a model using below code to achieve acceptable results :
for i in range(number_iterations):
# forward propagation
Z1 = np.dot(weight_layer_1, xtrain) + bias_1
a_1 = sigmoid(Z1)
Z2 = np.dot(weight_layer_2, a_1) + bias_2
a_2 = sigmoid(Z2)
mse_cost = np.sum(cost_all_examples)
cost_cross_entropy = -(1.0/len(X_train) * (np.dot(np.log(a_2), Y_train.T) + np.dot(np.log(1-a_2), (1-Y_train).T)))
# Back propagation and gradient descent
d_Z2 = np.multiply((a_2 - xtrain), d_sigmoid(a_2))
d_weight_2 = np.dot(d_Z2, a_1.T)
d_bias_2 = np.asarray(list(map(lambda x : [sum(x)] , d_Z2)))
# perform a parameter update in the negative gradient direction to decrease the loss
weight_layer_2 = weight_layer_2 + np.multiply(- learning_rate , d_weight_2)
bias_2 = bias_2 + np.multiply(- learning_rate , d_bias_2)
d_a_1 = np.dot(weight_layer_2.T, d_Z2)
d_Z1 = np.multiply(d_a_1, d_sigmoid(a_1))
d_weight_1 = np.dot(d_Z1, xtrain.T)
d_bias_1 = np.asarray(list(map(lambda x : [sum(x)] , d_Z1)))
weight_layer_1 = weight_layer_1 + np.multiply(- learning_rate , d_weight_1)
bias_1 = bias_1 + np.multiply(- learning_rate , d_bias_1)
Note the lines :
mse_cost = np.sum(cost_all_examples)
cost_cross_entropy = -(1.0/len(X_train) * (np.dot(np.log(a_2), Y_train.T) + np.dot(np.log(1-a_2), (1-Y_train).T)))
I can use either mse loss or cross entropy loss in order to inform how well the system is learning. But this is for informational purposes only, the choice of cost function is not impacting how the network learns. I believe I'm not understanding something fundamental as often in the deep learning literature it's stated the choice of loss function is an important step in deep learning ? But as shown in my code above I can choose cross entropy or mse loss and does not impact how network learns, cross entropy or mse loss is for informational purposes only ?
Update :
For example here is a snippet of code from deeplearning.ai that computes cost :
# GRADED FUNCTION: compute_cost
def compute_cost(A2, Y, parameters):
"""
Computes the cross-entropy cost given in equation (13)
Arguments:
A2 -- The sigmoid output of the second activation, of shape (1, number of examples)
Y -- "true" labels vector of shape (1, number of examples)
parameters -- python dictionary containing your parameters W1, b1, W2 and b2
Returns:
cost -- cross-entropy cost given equation (13)
"""
m = Y.shape[1] # number of example
# Retrieve W1 and W2 from parameters
### START CODE HERE ### (≈ 2 lines of code)
W1 = parameters['W1']
W2 = parameters['W2']
### END CODE HERE ###
# Compute the cross-entropy cost
### START CODE HERE ### (≈ 2 lines of code)
logprobs = np.multiply(np.log(A2), Y) + np.multiply((1 - Y), np.log(1 - A2))
cost = - np.sum(logprobs) / m
### END CODE HERE ###
cost = np.squeeze(cost) # makes sure cost is the dimension we expect.
# E.g., turns [[17]] into 17
assert(isinstance(cost, float))
return cost
This code runs as expected and achieves high accuracy / low cost. The value of the cost is not used in this implementation other than to offer information to machine learning engineer as to how well the network is learning. This causes me to question how the choice of cost function impacts how the the neural network learns ?