0

I'm having trouble understanding how to execute backward propagation of Leaky ReLU.

I have read other posts, and I'm still not quite sure I understand because of a lack of notation (not sure what is what).

If I have dA, or the activation of the current layer, and a cached value Z from forward propagation is this the correct implementation:

def leaky_relu_backward(dA, cache):
    """
    The backward propagation for a single leaky RELU unit.
    Arguments:
    dA - post-activation gradient
    cache - 'Z' where we store for computing backward propagation efficiently
    Returns:
    dZ - Gradient of the cost with respect to Z
    """
    Z = cache
    # just converting dz to a correct object.
    dZ = np.array(dA, copy=True)
    # When z <= 0, we should set dz to .01 
    dZ[Z <= 0] = .01
    return dZ

Or is there more too it? In this post: How to implement the derivative of Leaky Relu in python? The answer shows a multiplication happening on the return statement. Not sure if I need that or not.

lazylama
  • 47
  • 6

1 Answers1

2

You are missing the chain rule. For Leaky ReLU, the activation is

A = Z if Z > 0 else Z * 0.01

Which means:

dA/dZ = 1 if Z > 0 else 0.01

But to calculate with respect to the loss L, we have:

dL/dZ = dL/dA * dA/dZ

where dL/dZ is your dZ, and dL/dA is your dA.

kwkt
  • 1,058
  • 3
  • 10
  • 19
  • Shouldn't `dA` just be `1 if Z > 0 else 0.01`? What is `dz` here? – Countour-Integral Jan 15 '21 at 15:20
  • @Countour-Integral that's dA/dZ, not dL/dA. Let me edit the answer so that it's clearer. – kwkt Jan 15 '21 at 15:22
  • Makes much more sense now. – Countour-Integral Jan 15 '21 at 15:24
  • Maybe this is out of scope, but what is the difference between calculating `dA/dZ` and calculating with respect to loss? Is `dL/dA` different than back propagation? I think that is what's confusing me. – lazylama Jan 15 '21 at 16:09
  • 1
    @lazylama your ultimate goal is to calculate dL/dW, where L is the loss and W is the weight. This gradient value can be interpreted as the direction L changes as W changes. Then, using gradient descent, you can change your weights so that the loss value reduces. dA/dZ gives us nothing interesting in itself, but just a middle part in the chain rule from dL to dW. – kwkt Jan 15 '21 at 16:21