XOR Neural Network(FF) converges to 0.5

Question

I've created a program that allows me to create flexible Neural networks of any size/length, however I'm testing it using the simple structure of an XOR setup(Feed forward, Sigmoid activation, back propagation, no batching).

EDIT: The following is a completely new approach to my original question which didn't supply enough information

EDIT 2: I started my weight between -2.5 and 2.5, and fixed a problem in my code where I forgot some negatives. Now it either converges to 0 for all cases or 1 for all, instead of 0.5

Everything works exactly the way that I THINK it should, however it is converging toward 0.5, instead of oscillating between outputs of 0 and 1. I've completely gone through and hand calculated an entire setup of feeding forward/calculating delta errors/back prop./ etc. and it matched what I got from the program. I have also tried optimizing it by changing learning rate/ momentum, as well as increase complexity in the network(more neurons/layers).

Because of this, I assume that either one of my equations is wrong, or I have some other sort of misunderstanding in my Neural Network. The following is the logic with equations that I follow for each step:

I have an input layer with two inputs and a bias, a hidden with 2 neurons and a bias, and an output with 1 neuron.

Take the input from each of the two input neurons and the bias neuron, then multiply them by their respective weights, and then add them together as the input for each of the two neurons in the hidden layer.
Take the input of each hidden neuron, pass it through the Sigmoid activation function (Reference 1) and use that as the neuron's output.
Take the outputs of each neuron in hidden layer (1 for the bias), multiply them by their respective weights, and add those values to the output neuron's input.
Pass the output neuron's input through the Sigmoid activation function, and use that as the output for the whole network.
Calculate the Delta Error(Reference 2) for the output neuron
Calculate the Delta Error(Reference 3) for each of the 2 hidden neurons
Calculate the Gradient(Reference 4) for each weight (starting from the end and working back)
Calculate the Delta Weight(Reference 5) for each weight, and add that to its value.
Start the process over with by Changing the inputs and expected output(Reference 6)

Here are the specifics of those references to equations/processes (This is probably where my problem is!):

x is the input of the neuron: (1/(1 + Math.pow(Math.E, (-1 * x))))
-1*(actualOutput - expectedOutput)*(Sigmoid(x) * (1 - Sigmoid(x))//Same sigmoid used in reference 1
SigmoidDerivative(Neuron.input)*(The sum of(Neuron.Weights * the deltaError of the neuron they connect to))
ParentNeuron.output * NeuronItConnectsTo.deltaError
learningRate*(weight.gradient) + momentum*(Previous Delta Weight)
I have an arrayList with the values 0,1,1,0 in it in that order. It takes the first pair(0,1), and then expects a 1. For the second time through, it takes the second pair(1,1) and expects a 0. It just keeps iterating through the list for each new set. Perhaps training it in this systematic way causes the problem?

Like I said before, they reason I don't think it's a code problem is because it matched exactly what I had calculated with paper and pencil (which wouldn't have happened if there was a coding error).

Also when I initialize my weights the first time, I give them a random double value between 0 and 1. This article suggests that that may lead to a problem: Neural Network with backpropogation not converging Could that be it? I used the n^(-1/2) rule but that did not fix it.

If I can be more specific or you want other code let me know, thanks!

Short summary of your question: I have a bunch of code that I don't show because I am pretty sure it is correct. But something doesn't work. What's wrong? — Henry, Sep 16 '16 at 05:06
@Henry yes I apologize but I don't know where to begin. It's a combination of 3 classes each with a couple hundred lines of code. However a neural network converging toward 0.5 is a very specific problem, and I'm hoping somebody has at least a direction for me to begin looking. I'll add in my deltaerror and gradient code since that isnt too long — Marshall D, Sep 16 '16 at 12:38
@Henry I just completely redid the question supplying a lot more information and explanations for areas. Hopefully you can re-evaluate? Thanks! — Marshall D, Sep 17 '16 at 00:49

Jjoseph · Accepted Answer · 2016-10-07T18:48:02.990

This is wrong

SigmoidDerivative(Neuron.input)*(The sum of(Neuron.Weights * the deltaError of the neuron they connect to)) First is sigmoid activation (g) second is derivative of sigmoid activation

private double g(double z) {
    return 1 / (1 + Math.pow(2.71828, -z));
}

private double gD(double gZ) {
    return gZ * (1 - gZ);
}

Unrelated note: Your notation of (-1*x) is really strange just use -x

Your implementation from how you phrase the steps of your ANN seems poor. Try to focus on implementing Forward/BackPropogation and then an UpdateWeights method. Creating a matrix class

This is my Java implementation, its very simple and somewhat rough. I use a Matrix class to make the math behind it appear very simple in code.

If you can code in C++ you can overload operaters which will enable for even easier writing of comprehensible code.

https://github.com/josephjaspers/ArtificalNetwork/blob/master/src/artificalnetwork/ArtificalNetwork.java

Here are the algorithms (C++)

All of these codes can be found on my github (the Neural nets are simple and funcitonal) Each layer includes the bias nodes, which is why there are offsets

void NeuralNet::forwardPropagation(std::vector<double> data) {
    setBiasPropogation(); //sets all the bias nodes activation to 1
    a(0).set(1, Matrix(data)); //1 to offset for bias unit (A = X)

    for (int i = 1; i < layers; ++i) {
        //  (set(1 -- offsets the bias unit

        z(i).set(1, w(i - 1) * a(i - 1)); 
        a(i) = g(z(i)); // g(z ) if the sigmoid function
    }
}
void NeuralNet::setBiasPropogation() {
    for (int i = 0; i < activation.size(); ++i) {
        a(i).set(0, 0, 1);
    }
}

outLayer D = A - Y (y is the output data) hiddenLayers d^l = (w^l(T) * d^l+1) *: gD(a^l)

d = derivative vector

W = weights matrix (Length = connections, width = features)

a = activation matrix

gD = derivative function

^l = IS NOT POWER OF (this just means at layer l)

= dotproduct

*: = multiply (multiply each element "through")

cpy(n) returns a copy of the matrix offset by n (ignores n rows)

void NeuralNet::backwardPropagation(std::vector<double> output) {
    d(layers - 1) = a(layers - 1) - Matrix(output);
    for (int i = layers - 2; i > -1; --i) {
    d(i) = (w(i).T() * d(i + 1).cpy(1)).x(gD(a(i))); 
    }       
}

Explaining this code maybe confusing without images so I'm sending this link which I think is a good source, it also contains an explanation of BackPropagation which may be better then my own explanation. http://galaxy.agh.edu.pl/~vlsi/AI/backp_t_en/backprop.html

void NeuralNet::updateWeights() {
    // the operator () (int l, int w) returns a double reference at that position in the matrix
    // thet operator [] (int n) returns the nth double (reference) in the matrix (useful for vectors) 
    for (int l = 0; l < layers - 1; ++l) {
        for (int i = 1; i < d(l + 1).length(); ++i) {
            for (int j = 0; j < a(l).length(); ++j) {
                w(l)(i - 1, j) -= (d(l + 1)[i] * a(l)[j]) * learningRate + m(l)(i - 1, j);
                m(l)(i - 1, j) = (d(l + 1)[i] * a(l)[j]) * learningRate * momentumRate;
            }
        }
    }
}

Thanks for the answer! However, after implementing g and gD I'm still having a similar problem. Now it converges a LOT faster, but it still either goes to only 1 or only 0 and then shortly after that goes to NaN (for both 0 and 1). — Marshall D, Sep 19 '16 at 21:43
Hello sorry for the late response. In regards to your issue for converging at local minima. This actually occurs in all Neural Networks. For instance if you set all your weights to the same constant opposed to random initialization, when iterating through the data set each weight becomes adjusted very very slightly ("moving") in the same direction. This is what happening when you get stuck in a local minima. Many Neural Networks even need to be rerun multiple times till an adequate minima (whether local or optimum) is found. — Jjoseph, Oct 03 '16 at 16:39
If it is not optimizing consistently you can add more nodes to the hidden layer. You can test how this increases optimization chances through an Xor network 2-2-1 compared to perhaps a 2-10-1 (The increase of nodes decreases the chances they weights begin to become to close to eachother) The fact your getting a nan error means that your delta is most likely not being calculated correctly or perhaps you have a negative symbol your either forgot or added incorrectly (as once it gets close to 0 in error the weights should barely shift). You can attempt to handle this with a rounding method. — Jjoseph, Oct 03 '16 at 16:42
I tried adding a lot of complexity in the system, and I randomize my weights between -2.5 and 2.5; still get the problem. I suppose it must be the delta error then. Here are my two equations I use: `-(ActualOutput - Expected)*Sigmoid'(Neuron.input)` for the outPut neuron, and `Sigmoid'(Neuron.input)*(sum of (Weight * theNeuronItFeedsTo's delta error))` for every other neuron. My sigmoid derivative is `-(e^-x)/(1+e^-x)^2`. Are these wrong? Thankyou so much it means a lot! — Marshall D, Oct 05 '16 at 03:31
What sources are you using for your Neural Net? I Recommend Coursera's Andrew Ng Machine Learning Course (Week 4/5 he explains each algorithm of a NeuralNetwork). I added algorithms to my response. You sigmoid derivative is wrong, its just (a / (1 - a)) a will already have been set to (1 / 1 + e ^-z) — Jjoseph, Oct 07 '16 at 19:02

XOR Neural Network(FF) converges to 0.5

1 Answers1

Linked