Why DNN doesnt learn?

Question

Do you have any idea why this network doesn't want to learn? The idea is that it uses ReLU as an activation function in earlier layers and sigmoid as an activation function in the last layer. The network learned fine when I used only sigmoid. To verify the network I used MNIST.

def sigmoid( z ):
    return 1.0 / (1.0 + np.exp(-z))

def sigmoid_prime(z):
    return sigmoid(z)*(1-sigmoid(z))

def RELU(z):
    return z*(z>0)

def RELU_Prime(z):
    return (z>0)

    # x - training data in mnist for example (1,784) vector
    # y - training label in mnist for example (1,10) vector
    # nabla is gradient for the current x and y 
    def backprop(self, x, y):
        nabla_b = [np.zeros(b.shape) for b in self.biases]
        nabla_w = [np.zeros(w.shape) for w in self.weights]
        # feedforward
        activation = x
        activations = [x] # list to store all the activations, layer by layer
        zs = [] # list to store all the z vectors, layer by layer
        index =0
        for b, w in zip(self.biases, self.weights):
            z = np.dot(w, activation)+b
            zs.append(z)
            if index == len(self.weights)-1:
                activation = sigmoid(z)
            #previous layers are RELU
            else:
                activation = RELU(z)

            activations.append(activation)
            index +=1
        # backward pass
        delta = self.cost_derivative(activations[-1], y) *\
             sigmoid_prime(zs[-1])

        nabla_b[-1] = delta

        nabla_w[-1] = np.dot(delta, activations[-2].transpose())
        for l in range(2, self.num_layers):
            z = zs[-l]
            sp = RELU_Prime(z)
            delta = np.dot(self.weights[-l+1].transpose(), delta) * sp
            nabla_b[-l] = delta
            nabla_w[-l] = np.dot(delta, activations[-l-1].transpose())
        return (nabla_b, nabla_w)

--------------- Edit -----------------------------

    def cost_derivative(self, output_activations, y):
        return (output_activations-y)

--------------- Edit 2 -----------------------------

      self.weights = [w-(eta/len(mini_batch))*nw
                       for w, nw in zip(self.weights, nabla_w)]
       self.biases = [b-(eta/len(mini_batch))*nb
                      for b, nb in zip(self.biases, nabla_b)]

eta > 0

ReLU has derivative 0 on the negative. So if the input to ReLU is negative, nothing is backpropagated to ReLU's input. You could try with a slightly modified version: `def RELU(z): return z*(0.1 + 0.9 * (z>0))`; `def RELU_Prime(z): return 0.1 + 0.9 * (z>0)`. — Stef, Nov 11 '20 at 17:13
@Marat I have posted only part with backprop.I use another function to update weights. It's basically Nielsen's code Network 1 with added RELU — XXXXXXXX, Nov 11 '20 at 17:26
I asked because perhaps the most common mistake in implementing backpropagation is not to reverse gradient sign when updating weights. If this is some existing code maybe it expects nablas to already include that reverse — Marat, Nov 11 '20 at 17:47

score 0 · Answer 1 · edited Nov 22 '20 at 07:44

0

For those in future the answer for this problem is simple but hidden :). It turns out the weight initialization was wrong. To make it work you have to use Xavier initialization and multiply it by 2.

edited Nov 22 '20 at 07:44

marc_s

732,580
175
1,330
1,459

answered Nov 18 '20 at 01:26

XXXXXXXX

23
5

Why DNN doesnt learn?

1 Answers1