I'm implementing a simple neural network following the Andrew Ng's tutorial in Coursera. I performed gradient checking in order to check the correction of my calculations of the gradients in my Backprop algorithm, and my computed gradients are the same as the obtained by this other method, so I'm quite confident that my implementation should be ok. However, I'm obtaining really bad results (45% accuracy) identifying digits.
My doubt comes here: If I remove the sigmoid derivative when I compute the internal deltas, I get an accuracy of 90%. I don't understand why I have originally so bad results, and why by doing this I improve so much. Also, when I remove the sigmoid derivative the computed gradients have enormous differences with the output from gradient checking (obviously, since I'm not computing the derivatives anymore).
The relevant part of the tutorial that I'm following is this one:
And my Backprop implementation is:
def backpropagation(self, X, y):
n_elements = X.shape[0]
DELTA_theta = [np.zeros(t.shape) for t in self.theta]
DELTA_bias = [np.zeros(b.shape) for b in self.bias]
for i in range(0, n_elements):
A = self.forwardpropagation(X[i])
delta = np.copy(A[-1])
delta[y[i]] -= 1
for l in reversed(range(0, self.n_layers - 1)):
DELTA_theta[l] += np.outer(A[l], delta)
DELTA_bias[l] += delta
if l != 0:
delta = np.dot(self.theta[l], delta) * A[l] * (1 - A[l])
# delta = np.dot(self.theta[l], delta) THIS GIVES MUCH BETTER RESULTS
gradient_theta = [d + self.regularization * self.theta[i] for i, d in enumerate(DELTA_theta)]
gradient_theta = [g / n_elements for g in gradient_theta]
gradient_bias = [d / n_elements for d in DELTA_bias]
estimated_gradient_theta, estimated_gradient_bias = self.gradient_checking(X, y)
diff_theta = [np.amax(g-e) for g, e in zip(gradient_theta, estimated_gradient_theta)]
diff_bias = [np.amax(g-e) for g, e in zip(gradient_bias, estimated_gradient_bias)]
print(max(diff_theta)) # Around 1.0e-07
print(max(diff_bias)) # Around 1.0e-10
return gradient_theta + gradient_bias
(Note that my weights are stored in self.theta and that the matrix has different dimensions than in the tutorial since mine by default is transposed)
Any idea of why this is happening? I've spent so many hours with this... Thanks!