Part 2 Resilient backpropagation neural network

Question

This is a follow-on question to this post. For a given neuron, I'm unclear as to how to take a partial derivative of its error and the partial derivative of it's weight.

Working from this web page, it's clear how the propogation works (although I'm dealing with Resilient Propagation). For a Feedforward Neural Network, we have to 1) while moving forwards through the neural net, trigger neurons, 2) from the output layer neurons, calculate a total error. Then 3) moving backwards, propogate that error by each weight in a neuron, then 4) coming forwards again, update the weights in each neuron.

Precisely though, these are the things I don't understand.

A) For each neuron, how do you calculate the partial derivative (definition) of the error over the partial derivative of the weight? My confusion is that, in calculus, a partial derivative is computed in terms of an n variable function. I'm sort of understanding ldog and Bayer's answers in this post. And I even understnad the chain rule. But it doesn't gel when I think, precisely, of how to apply it to the results of a i) linear combiner and ii) sigmoid activation function.

B) Using the Resilient propogation approach, how would you change the bias in a given neuron ? Or is there no bias or threshold in a NN using Resilient Propagation training?

C) How do you propagate a total error if there are two or more output neurons ? Does the total-error * neuron weight happen for each output neuron value?

Thanks

score 0 · Answer 1 · answered Jun 05 '16 at 21:51

A)

In supervised learning tasks, the overall optimization objective is the summed loss over all training examples and is defined as E = \sum_n loss(y_n, t_n), where n is an index over all training examples, y_n refers to the network output for training example n, t_n is the label of training example n and loss refers to the loss function. Note that y_n and t_n are in general vectorized quantities---the vector length is determined by the number of output neurons in the network.

One possible choice for the loss function is the squared error defined as loss(y, t) = \sum_k (y_k - t_k) ^ 2, where k refers to the number of output neurons in the network. In backpropagation, one has to compute the partial derivative of the overall optimization objective with respect to the network parameters---which are synaptic weights and neuron biases. This is achieved through the following formula according to the chain rule:

(\partial E / \partial w_{ij}) = (\partial E / \partial out_j) * (\partial out_j / \partial in_j) * (\partial in_j / partial w_{ij}),

where w_{ij} refers to the weight between neuron i and neuron j, out_j refers to the output of neuron j and in_j refers to the input to neuron j.

How to compute the neuron output out_j and its derivative with respect to the neuronal input in_j depends on which activation function is used. In case you use a liner activation function to compute a neuron's output out_j, the term (\partial out_j / \partial in_j) becomes 1. In case you use for example the logistic function as activation function, the term (\partial out_j / \partial in_j) becomes sig(in_j) * (1 - sig(in_j)), where sig is the logistic function.

B)

In resilient backpropagation, biases are updated exactly the same way as weights---based on the sign of partial derivatives and individual adjustable step sizes.

C)

I am not quite sure if I understand correctly. The overall optimization objective is a scalar function of all network parameters, no matter how many output neurons there are. So there should be no confusion regarding how to compute partial derivatives here.

In general, in order to compute the partial derivative (\partial E / \partial w_{ij}) of the overall optimization objective E with respect to some weight w_{ij}, one has to compute the partial derivative (\partial out_k / \partial w_{ij}) of each output neuron k with respect to w_{ij} as

(\partial E / \partial w_{ij}) = \sum_k (\partial E / \partial out_k) * (\partial out_k / \partial w_{ij}).

Note however that the partial derivative (\partial out_k / \partial w_{ij}) of the output neuron k with respect to w_{ij} will be zero if w_{ij} does not impact the output out_k of output neuron k.

One more thing. In case one uses the squared error as loss function, the partial derivative (\partial E / \partial out_k) of the overall optimization objective E with respect to the output out_k of some output neuron k is

(\partial E / \partial out_k) = \sum_k 2 * (out_k - t_k),

where the quantity (out_k - t_k) is referred to as error attached to output unit k and where I assumed only one single training example with label t for notational convenience. Note that if w_{ij} does not have any impact on the output out_k of output neuron k, then the update of w_{ij} will not depend on the error (out_k - t_k) because (\partial out_k / \partial w_{ij}) = 0 as mentioned above.

A final remark to avoid any confusion. y_k and out_k refer both to the output of output neuron k in the network.

score -1 · Answer 2 · answered Jul 12 '13 at 12:08

Not 100% sure on the other points, but I can answer B at this moment:

B)The bias is updated based on the direction of the partial derivative, and not on the magnitude. the size of the weight update is increased if the direction remains unchanged for consecutive iterations. oscillating directions will reduce the size of update. http://nopr.niscair.res.in/bitstream/123456789/8460/1/IJEMS%2012(5)%20434-442.pdf

score -1 · Answer 3 · answered May 26 '16 at 11:53

For me (also thinking on terms of calculus and symbolic equations), the thing with the derivatives only made click after I realized that it's all about putting the function in terms of itself and thus avoiding the differentiation process as such.

A couple of examples (python) may help...

If I have the linear activation function:

def f_act( x ):
    return x

then the derivative is easy, everywhere where I need d( f_act ), I put a 1:

def der_f_act( y ):
    return 1

Likewise, if I have a logistic activation function:

f_a = 1 / ( 1 + e^(-x) )

then the derivative can be written in terms of the function itself (here the details) as:

d( f_a ) = f_a ( 1 - f_a )

All that can be coded as:

def f_act( x ):
    return 1 / ( 1 + numpy.exp(-1*x) )

def der_f_act( y ):
    return y * ( 1 - y )

For these examples, I already had the value of the activation function (from the feedforward phase), so I can profit from that and just calculate the at that point ;)

That is one reason to prefer certain activation functions: Some have very convenient derivatives, which makes then easy and eficient to implement, especially if you're talking about a bunch of nodes in neural nets.

Part 2 Resilient backpropagation neural network

3 Answers3

Linked