Sigmoid derivative implementing Backprop for a Neural Network

Question

I'm implementing a simple neural network following the Andrew Ng's tutorial in Coursera. I performed gradient checking in order to check the correction of my calculations of the gradients in my Backprop algorithm, and my computed gradients are the same as the obtained by this other method, so I'm quite confident that my implementation should be ok. However, I'm obtaining really bad results (45% accuracy) identifying digits.

My doubt comes here: If I remove the sigmoid derivative when I compute the internal deltas, I get an accuracy of 90%. I don't understand why I have originally so bad results, and why by doing this I improve so much. Also, when I remove the sigmoid derivative the computed gradients have enormous differences with the output from gradient checking (obviously, since I'm not computing the derivatives anymore).

The relevant part of the tutorial that I'm following is this one:

And my Backprop implementation is:

def backpropagation(self, X, y):
    n_elements = X.shape[0]

    DELTA_theta = [np.zeros(t.shape) for t in self.theta]
    DELTA_bias = [np.zeros(b.shape) for b in self.bias]

    for i in range(0, n_elements):
        A = self.forwardpropagation(X[i])
        delta = np.copy(A[-1])
        delta[y[i]] -= 1

        for l in reversed(range(0, self.n_layers - 1)):
            DELTA_theta[l] += np.outer(A[l], delta)
            DELTA_bias[l] += delta

            if l != 0:
                delta = np.dot(self.theta[l], delta) * A[l] * (1 - A[l])
                # delta = np.dot(self.theta[l], delta) THIS GIVES MUCH BETTER RESULTS


    gradient_theta = [d + self.regularization * self.theta[i] for i, d in enumerate(DELTA_theta)]
    gradient_theta = [g / n_elements for g in gradient_theta]
    gradient_bias = [d / n_elements for d in DELTA_bias]

    estimated_gradient_theta, estimated_gradient_bias = self.gradient_checking(X, y)
    diff_theta = [np.amax(g-e) for g, e in zip(gradient_theta, estimated_gradient_theta)]
    diff_bias = [np.amax(g-e) for g, e in zip(gradient_bias, estimated_gradient_bias)]

    print(max(diff_theta))  # Around 1.0e-07
    print(max(diff_bias))   # Around 1.0e-10

    return gradient_theta + gradient_bias

(Note that my weights are stored in self.theta and that the matrix has different dimensions than in the tutorial since mine by default is transposed)

Any idea of why this is happening? I've spent so many hours with this... Thanks!

The code in [this question](https://stackoverflow.com/questions/44613838/how-to-use-neural-nets-to-recognize-handwritten-digits) should help, it implements the code from the tutorial, IIRC. — cs95, Jan 22 '18 at 13:26
Thanks @cᴏʟᴅsᴘᴇᴇᴅ for the link, but I don't think this is a duplicate of that question. If it's a duplicate of another one, could you send me that link? I've searched a lot and I didn't find a similar one (Note that I already have my implementation, I want to understand why the behaviour removing the sigmoid derivative) — Gonzalo Solera, Jan 22 '18 at 13:33
Also, in the solution of your link, I think they are ignoring the sigmoid derivative, but I still don't understand why this works better when it has such big differences with the gradient checking's output — Gonzalo Solera, Jan 22 '18 at 13:34
It really depends on the data. You can't tell what activation works best. There isn't any formal activation theory or one-fits-all. — cs95, Jan 22 '18 at 13:37
I understand that a different activation function will generate different results, but I'm not changing the activation function (It's always the sigmoid). What I'm doing is removing the derivative of the sigmoid function when I compute the gradient. (So it shouldn't be computing the gradient anymore, although it obtains better results). — Gonzalo Solera, Jan 22 '18 at 13:43

Sigmoid derivative implementing Backprop for a Neural Network

0 Answers0