Impact of using relu for gradient descent

Question

What impact does the fact the relu activation function does not contain a derivative ?

How to implement the ReLU function in Numpy implements relu as maximum of (0 , matrix vector elements).

Does this mean for gradient descent we do not take derivative of relu function ?

Update :

From Neural network backpropagation with RELU

this text aids in understanding :

The ReLU function is defined as: For x > 0 the output is x, i.e. f(x) = max(0,x)

So for the derivative f '(x) it's actually:

if x < 0, output is 0. if x > 0, output is 1.

The derivative f '(0) is not defined. So it's usually set to 0 or you modify the activation function to be f(x) = max(e,x) for a small e.

Generally: A ReLU is a unit that uses the rectifier activation function. That means it works exactly like any other hidden layer but except tanh(x), sigmoid(x) or whatever activation you use, you'll instead use f(x) = max(0,x).

If you have written code for a working multilayer network with sigmoid activation it's literally 1 line of change. Nothing about forward- or back-propagation changes algorithmically. If you haven't got the simpler model working yet, go back and start with that first. Otherwise your question isn't really about ReLUs but about implementing a NN as a whole.

But this still leaves some confusion as the neural network cost function typically takes derivative of activation function, so for relu how does this impact cost function ?

It means you might lose all the guarantees in regards to GD (wiki calls it *ill-defined*). — sascha, Nov 30 '17 at 16:15
The text in the update does not address the question: (1): this is an sub-gradient. (2): the sub-gradient might be used in SGD; but not in plain GD. — sascha, Nov 30 '17 at 19:15

Imran · Answer 1 · 2017-12-01T02:28:35.567

The standard answer is that the input to ReLU is rarely exactly zero, see here for example, so it doesn't make any significant difference.

Specifically, for ReLU to get a zero input, the dot product of one entire row of the input to a layer with one entire column of the layer's weight matrix would have to be exactly zero. Even if you have an all-zero input sample, there should still be a bias term in the last position, so I don't really see this ever happening.

However, if you want to test for yourself, try implementing the derivative at zero as 0, 0.5, and 1 and see if anything changes.

The PyTorch docs give a simple neural network with numpy example with one hidden layer and relu activation. I have reproduced it below with a fixed random seed and three options for setting the behavior of the ReLU gradient at 0. I have also added a bias term.

N, D_in, H, D_out = 4, 2, 30, 1

# Create random input and output data
x = x = np.random.randn(N, D_in)
x = np.c_(x, no.ones(x.shape[0]))
y = x = np.random.randn(N, D_in)

np.random.seed(1)

# Randomly initialize weights
w1 = np.random.randn(D_in+1, H)
w2 = np.random.randn(H, D_out)

learning_rate = 0.002
loss_col = []
for t in range(200):
    # Forward pass: compute predicted y
    h = x.dot(w1)
    h_relu = np.maximum(h, 0)  # using ReLU as activate function
    y_pred = h_relu.dot(w2)

    # Compute and print loss
    loss = np.square(y_pred - y).sum() # loss function
    loss_col.append(loss)
    print(t, loss, y_pred)

    # Backprop to compute gradients of w1 and w2 with respect to loss
    grad_y_pred = 2.0 * (y_pred - y) # the last layer's error
    grad_w2 = h_relu.T.dot(grad_y_pred)
    grad_h_relu = grad_y_pred.dot(w2.T) # the second laye's error 
    grad_h = grad_h_relu.copy()        
    grad_h[h < 0] = 0  # grad at zero = 1
    # grad[h <= 0] = 0 # grad at zero = 0
    # grad_h[h < 0] = 0; grad_h[h == 0] = 0.5 # grad at zero = 0.5
    grad_w1 = x.T.dot(grad_h)

    # Update weights
    w1 -= learning_rate * grad_w1
    w2 -= learning_rate * grad_w2

Impact of using relu for gradient descent

1 Answers1