Understanding Leaky ReLU Derivative with Notation

Question

I'm having trouble understanding how to execute backward propagation of Leaky ReLU.

I have read other posts, and I'm still not quite sure I understand because of a lack of notation (not sure what is what).

If I have dA, or the activation of the current layer, and a cached value Z from forward propagation is this the correct implementation:

def leaky_relu_backward(dA, cache):
    """
    The backward propagation for a single leaky RELU unit.
    Arguments:
    dA - post-activation gradient
    cache - 'Z' where we store for computing backward propagation efficiently
    Returns:
    dZ - Gradient of the cost with respect to Z
    """
    Z = cache
    # just converting dz to a correct object.
    dZ = np.array(dA, copy=True)
    # When z <= 0, we should set dz to .01 
    dZ[Z <= 0] = .01
    return dZ

Or is there more too it? In this post: How to implement the derivative of Leaky Relu in python? The answer shows a multiplication happening on the return statement. Not sure if I need that or not.

kwkt · Answer 1 · 2021-01-15T15:23:10.700

2

You are missing the chain rule. For Leaky ReLU, the activation is

A = Z if Z > 0 else Z * 0.01

Which means:

dA/dZ = 1 if Z > 0 else 0.01

But to calculate with respect to the loss L, we have:

dL/dZ = dL/dA * dA/dZ

where dL/dZ is your dZ, and dL/dA is your dA.

edited Jan 15 '21 at 15:23

answered Jan 15 '21 at 15:16

kwkt

1,058
3
10
19

Shouldn't `dA` just be `1 if Z > 0 else 0.01`? What is `dz` here? – Countour-Integral Jan 15 '21 at 15:20
@Countour-Integral that's dA/dZ, not dL/dA. Let me edit the answer so that it's clearer. – kwkt Jan 15 '21 at 15:22
Makes much more sense now. – Countour-Integral Jan 15 '21 at 15:24
Maybe this is out of scope, but what is the difference between calculating `dA/dZ` and calculating with respect to loss? Is `dL/dA` different than back propagation? I think that is what's confusing me. – lazylama Jan 15 '21 at 16:09
1

@lazylama your ultimate goal is to calculate dL/dW, where L is the loss and W is the weight. This gradient value can be interpreted as the direction L changes as W changes. Then, using gradient descent, you can change your weights so that the loss value reduces. dA/dZ gives us nothing interesting in itself, but just a middle part in the chain rule from dL to dW. – kwkt Jan 15 '21 at 16:21

Understanding Leaky ReLU Derivative with Notation

1 Answers1