2

Do you have any idea why this network doesn't want to learn? The idea is that it uses ReLU as an activation function in earlier layers and sigmoid as an activation function in the last layer. The network learned fine when I used only sigmoid. To verify the network I used MNIST.

def sigmoid( z ):
    return 1.0 / (1.0 + np.exp(-z))

def sigmoid_prime(z):
    return sigmoid(z)*(1-sigmoid(z))

def RELU(z):
    return z*(z>0)

def RELU_Prime(z):
    return (z>0)

    # x - training data in mnist for example (1,784) vector
    # y - training label in mnist for example (1,10) vector
    # nabla is gradient for the current x and y 
    def backprop(self, x, y):
        nabla_b = [np.zeros(b.shape) for b in self.biases]
        nabla_w = [np.zeros(w.shape) for w in self.weights]
        # feedforward
        activation = x
        activations = [x] # list to store all the activations, layer by layer
        zs = [] # list to store all the z vectors, layer by layer
        index =0
        for b, w in zip(self.biases, self.weights):
            z = np.dot(w, activation)+b
            zs.append(z)
            if index == len(self.weights)-1:
                activation = sigmoid(z)
            #previous layers are RELU
            else:
                activation = RELU(z)

            activations.append(activation)
            index +=1
        # backward pass
        delta = self.cost_derivative(activations[-1], y) *\
             sigmoid_prime(zs[-1])

        nabla_b[-1] = delta

        nabla_w[-1] = np.dot(delta, activations[-2].transpose())
        for l in range(2, self.num_layers):
            z = zs[-l]
            sp = RELU_Prime(z)
            delta = np.dot(self.weights[-l+1].transpose(), delta) * sp
            nabla_b[-l] = delta
            nabla_w[-l] = np.dot(delta, activations[-l-1].transpose())
        return (nabla_b, nabla_w)

--------------- Edit -----------------------------

    def cost_derivative(self, output_activations, y):
        return (output_activations-y)

--------------- Edit 2 -----------------------------

      self.weights = [w-(eta/len(mini_batch))*nw
                       for w, nw in zip(self.weights, nabla_w)]
       self.biases = [b-(eta/len(mini_batch))*nb
                      for b, nb in zip(self.biases, nabla_b)]

eta > 0

XXXXXXXX
  • 23
  • 5
  • ReLU has derivative 0 on the negative. So if the input to ReLU is negative, nothing is backpropagated to ReLU's input. You could try with a slightly modified version: `def RELU(z): return z*(0.1 + 0.9 * (z>0))`; `def RELU_Prime(z): return 0.1 + 0.9 * (z>0)`. – Stef Nov 11 '20 at 17:13
  • can you post `.cost_derivative`? – Marat Nov 11 '20 at 17:18
  • it doesn't seem you're changing weights anywhere – Marat Nov 11 '20 at 17:23
  • @Stef I did as you said but unfortunately stll didnt work – XXXXXXXX Nov 11 '20 at 17:24
  • @Marat I have posted only part with backprop.I use another function to update weights. It's basically Nielsen's code Network 1 with added RELU – XXXXXXXX Nov 11 '20 at 17:26
  • I asked because perhaps the most common mistake in implementing backpropagation is not to reverse gradient sign when updating weights. If this is some existing code maybe it expects nablas to already include that reverse – Marat Nov 11 '20 at 17:47
  • @Marat I edited post and added subtraction of gradient – XXXXXXXX Nov 11 '20 at 17:49

1 Answers1

0

For those in future the answer for this problem is simple but hidden :). It turns out the weight initialization was wrong. To make it work you have to use Xavier initialization and multiply it by 2.

marc_s
  • 732,580
  • 175
  • 1,330
  • 1,459
XXXXXXXX
  • 23
  • 5