Stuck implementing simple neural network

Question

I've been bashing my head against this brick wall for what seems like an eternity, and I just can't seem to wrap my head around it. I'm trying to implement an autoencoder using only numpy and matrix multiplication. No theano or keras tricks allowed.

I'll describe the problem and all its details. It is a bit complex at first since there are a lot of variables, but it really is quite straightforward.

What we know

1) X is an m by n matrix which is our inputs. The inputs are rows of this matrix. Each input is an n dimensional row vector, and we have m of them.

2)The number of neurons in our (single) hidden layer, which is k.

3) The activation function of our neurons (sigmoid, will be denoted as g(x)) and its derivative g'(x)

What we don't know and want to find

Overall our goal is to find 6 matrices: w1 which is n by k, b1 which is m by k, w2 which is k by n, b2 which is m by n, w3 which is n by n and b3 which is m by n.

They are initallized randomly and we find the best solution using gradient descent.

The process

The entire process looks something like this

First we compute z1 = Xw1+b1. It is m by k and is the input to our hidden layer. We then compute h1 = g(z1), which is simply applying the sigmoid function to all elements of z1. naturally it is also m by k and is the output of our hidden layer.

We then compute z2 = h1w2+b2 which is m by n and is the input to the output layer of our neural network. Then we compute h2 = g(z2) which again is naturally also m by n and is the output of our neural network.

Finally, we take this output and perform some linear operator on it: Xhat = h2w3+b3 which is also m by n and is our final result.

Where I am stuck

The cost function I want to minimize is the mean squared error. I already implemented it in numpy code

def cost(x, xhat):
    return (1.0/(2 * m)) * np.trace(np.dot(x-xhat,(x-xhat).T))

The problem is finding the derivatives of cost with respect to w1,b1,w2,b2,w3,b3. Let's call the cost S.

After deriving myself and checking myself numerically, I have established the following facts:

1) dSdxhat = (1/m) * np.dot(xhat-x)

2) dSdw3 = np.dot(h2.T,dSdxhat)

3) dSdb3 = dSdxhat

4) dSdh2 = np.dot(dSdxhat, w3.T)

But I can't for the life of me figure out dSdz2. It's a brick wall.

From chain-rule, it should be that dSdz2 = dSdh2 * dh2dz2 but the dimensions don't match.

What is the formula to compute the derivative of S with respect to z2?

Edit - This is my code for the entire feed forward operation of the autoencoder.

import numpy as np

def g(x): #sigmoid activation functions
    return 1/(1+np.exp(-x)) #same shape as x!

def gGradient(x): #gradient of sigmoid
    return g(x)*(1-g(x)) #same shape as x!

def cost(x, xhat): #mean squared error between x the data and xhat the output of the machine
    return (1.0/(2 * m)) * np.trace(np.dot(x-xhat,(x-xhat).T))

#Just small random numbers so we can test that it's working small scale
m = 5 #num of examples
n = 2 #num of features in each example
k = 2 #num of neurons in the hidden layer of the autoencoder
x = np.random.rand(m, n) #the data, shape (m, n)

w1 = np.random.rand(n, k) #weights from input layer to hidden layer, shape (n, k)
b1 = np.random.rand(m, k) #bias term from input layer to hidden layer (m, k)
z1 = np.dot(x,w1)+b1 #output of the input layer, shape (m, k)
h1 = g(z1) #input of hidden layer, shape (m, k)

w2 = np.random.rand(k, n) #weights from hidden layer to output layer of the autoencoder, shape (k, n)
b2 = np.random.rand(m, n) #bias term from hidden layer to output layer of autoencoder, shape (m, n)
z2 = np.dot(h1, w2)+b2 #output of the hidden layer, shape (m, n)
h2 = g(z2) #Output of the entire autoencoder. The output layer of the autoencoder. shape (m, n)

w3 = np.random.rand(n, n) #weights from output layer of autoencoder to entire output of the machine, shape (n, n)
b3 = np.random.rand(m, n) #bias term from output layer of autoencoder to entire output of the machine, shape (m, n)
xhat = np.dot(h2, w3)+b3 #the output of the machine, which hopefully resembles the original data x, shape (m, n)

Are you sure that your dimensions are not aligned simply because you added in the bias units to your list of h2 units? The derivative seems fine to me. — Paul Brodersen, Oct 02 '16 at 15:31
Well the difference in shapes is larger than 1, so it can't be the bias term. And I also take the bias term into consideration in my derivatives. — Oria Gruber, Oct 06 '16 at 23:33
Speaking of your bias terms -- normally, you would apply a constant bias (and learn the value of that), and not a different one on each iteration (i.e. they should have shape (k, ), and (n, ). There isn't much you can generalise when it changes for each input. Something else that confuses me is that you clearly have have 2 hidden layers but say that you only have one. I think it would be helpful if you provided the code for the full implementation - then we can see what other shenanigans you are doing. — Paul Brodersen, Oct 07 '16 at 10:26
Ah I'll explain - it's just like the diagram says. I have the original input layer (x), I use some projection and sigmoid to go to the hidden layer (h1), then I do it again to go to the output layer of the autoencoder (h2), and then i take the output of the autoencoder and apply some linear transformation on it to get the output of the entire machine (xhat). so h2 is the final layer of the autoencoder, but xhat is just linearly transformed h2, but we need to learn that transform as well.I will share my code. — Oria Gruber, Oct 07 '16 at 13:11
Also, as its the backward pass that is not working, we definitely need to see that, too. — Paul Brodersen, Oct 07 '16 at 14:16
Can you please explain why keeping the bias term constant is a good idea? What I'm doing is essentially the same as adding a column of 1's in each layer (which is the go to method). Also it seems strange that the bias should be randomly chosen rather than learnt. Regarding the backprop - I haven't implemented yet since I am missing the derivatives. there is nothing to implement yet. — Oria Gruber, Oct 07 '16 at 14:24
Well, at the moment you are not adding ones, you are adding random numbers (good), which change on each iteration (bad) -- you are just adding noise (and a lot in comparison with your inputs), not a bias. — Paul Brodersen, Oct 07 '16 at 14:46
Also, it is generally a good idea to a) de-mean your data, and b) have equally many positive and negative weights initially (in essence, rand -> randn for all initialisations). — Paul Brodersen, Oct 07 '16 at 14:48
https://stackoverflow.com/questions/47859737/keras-error-in-fit-method-expected-model-2-to-have-shape-none-252-252-1-b any suggestions? — Dexter, Dec 18 '17 at 07:08

wildwilhelm · Accepted Answer · 2016-10-08T12:13:47.277

4

OK, here's a suggestion. In the vector case, if you have x as a vector of length n, then g(x) is also a vector of length n. However, g'(x) is not a vector, it's the Jacobian matrix, and will be of size n X n. Similarly, in the minibatch case, where X is a matrix of size m X n, g(X) is m X n but g'(X) is n X n. Try:

def gGradient(x): #gradient of sigmoid
    return np.dot(g(x).T, 1 - g(x))

@Paul is right that the bias terms should be vectors, not matrices. You should have:

b1 = np.random.rand(k) #bias term from input layer to hidden layer (k,)
b2 = np.random.rand(n) #bias term from hidden layer to output layer of autoencoder, shape (n,)
b3 = np.random.rand(n) #bias term from output layer of autoencoder to entire output of the machine, shape (n,)

Numpy's broadcasting means that you don't have to change your calculation of xhat.

Then (I think!) you can compute the derivatives like this:

dSdxhat = (1/float(m)) * (xhat-x)
dSdw3 = np.dot(h2.T,dSdxhat)
dSdb3 = dSdxhat.mean(axis=0)
dSdh2 = np.dot(dSdxhat, w3.T)
dSdz2 = np.dot(dSdh2, gGradient(z2))
dSdb2 = dSdz2.mean(axis=0)
dSdw2 = np.dot(h1.T,dSdz2)
dSdh1 = np.dot(dSdz2, w2.T)
dSdz1 = np.dot(dSdh1, gGradient(z1))
dSdb1 = dSdz1.mean(axis=0)
dSdw1 = np.dot(x.T,dSdz1)

Does this work for you?

Edit

I've decided that I'm not at all sure that gGradient is supposed to be a matrix. How about:

dSdxhat = (xhat-x) / m
dSdw3 = np.dot(h2.T,dSdxhat)
dSdb3 = dSdxhat.sum(axis=0)
dSdh2 = np.dot(dSdxhat, w3.T)
dSdz2 = h2 * (1-h2) * dSdh2
dSdb2 = dSdz2.sum(axis=0)
dSdw2 = np.dot(h1.T,dSdz2)
dSdh1 = np.dot(dSdz2, w2.T)
dSdz1 = h1 * (1-h1) * dSdh1
dSdb1 = dSdz1.sum(axis=0)
dSdw1 = np.dot(x.T,dSdz1)

edited Oct 08 '16 at 12:13

answered Oct 07 '16 at 16:04

wildwilhelm

4,809
1
19
24

I'll try but I think there's a problem. look at the line z1=np.dot(x,w1)+b1. the dot product of x and w1 is m by k, but b1 is (according to you) k by 1. Adding them is impossible. Perhaps you meant that I should replicate this k by 1 vector m times? – Oria Gruber Oct 07 '16 at 16:44
So, [broadcasting](http://docs.scipy.org/doc/numpy/user/basics.broadcasting.html) means that numpy will automatically replicate the vector for you. – wildwilhelm Oct 07 '16 at 16:51
Oh I see, thank you very much, I didn't know that. I will be sure to check it and post the results soon! – Oria Gruber Oct 07 '16 at 16:56
That doesn't seem to be correct. the numerical derivative of z2 is not close to the dSdz2 as you calculate it. – Oria Gruber Oct 07 '16 at 17:09
Ah, I figure the gGradient is off by a factor of `m`. Is it better if you do `dSdz2 = np.dot(dSdh2, gGradient(z2)) / m`? – wildwilhelm Oct 07 '16 at 17:14
Afraid not. Also unrelated since I fixed it myself, but notice that gGradient should actually be np.dot(g(x).T, 1-g(x)). But that's not our issue since I fixed it. In one of my earlier versions, I managed to get a correct derivative for z2 but only for the case of a single vector rather than batch. Now that I replicate the biases rather than create a random matrix, maybe it will work now? – Oria Gruber Oct 07 '16 at 17:16
Ah, yes, you're right about `gGradient`. Stupid typo. – wildwilhelm Oct 07 '16 at 17:19
Ok so I think I made a breakthrough: Your code works for the derivative to z2, but the way you compute the gradient needs to be different. the gGradient should be an n by n diagonal matrix. I checked it now and it works! But only for z2. derivative for b2 seems to be incorrect. – Oria Gruber Oct 07 '16 at 18:06
Derivatives for bias terms are incorrect. It seems that when you replicated them you also effected the derivatives. derivative wrt b3, b2 and b1 are incorrect. z2 and w2 are fine. – Oria Gruber Oct 07 '16 at 19:02
@OriaGruber so I suppose that the bias derivative terms should be the sum across their corresponding dSdz term, in the same way that the weight derivatives are summed (`np.dot`) up across the `m` inputs. My first attempt used the mean, which I guess was wrong. The principle I'm trying to follow here is that none of your network parameters (weights, biases, nor any of their update terms) should be dependent on `m`: `m` is just the number of data points you're processing "at once" during training, and should not dictate anything about the shape of your network parameters. – wildwilhelm Oct 08 '16 at 12:42
I did manage to solve the problem thanks to your input and it is now working correctly. The key insight was to get rid of the bias terms b1 b2 b3 and instead add a column of ones to the data. I have accepted your answer due to your effort in helping me solve it. Thank you! – Oria Gruber Oct 08 '16 at 13:12
https://stackoverflow.com/questions/47859737/keras-error-in-fit-method-expected-model-2-to-have-shape-none-252-252-1-b any suggestions ? – Dexter Dec 18 '17 at 07:08

Stuck implementing simple neural network

1 Answers1