XOR neural network error stops decreasing during training

Question

I'm training a XOR neural network via back-propagation using stochastic gradient descent. The weights of the neural network are initialized to random values between -0.5 and 0.5. The neural network successfully trains itself around 80% of the time. However sometimes it gets "stuck" while backpropagating. By "stuck", I mean that I start seeing a decreasing rate of error correction. For example, during a successful training, the total error decreases rather quickly as the network learns, like so:

...
...
Total error for this training set: 0.0010008071327708653
Total error for this training set: 0.001000750550254843
Total error for this training set: 0.001000693973929822
Total error for this training set: 0.0010006374037948094
Total error for this training set: 0.0010005808398488103
Total error for this training set: 0.0010005242820908169
Total error for this training set: 0.0010004677305198344
Total error for this training set: 0.0010004111851348654
Total error for this training set: 0.0010003546459349181
Total error for this training set: 0.0010002981129189812
Total error for this training set: 0.0010002415860860656
Total error for this training set: 0.0010001850654351723
Total error for this training set: 0.001000128550965301
Total error for this training set: 0.0010000720426754587
Total error for this training set: 0.0010000155405646494
Total error for this training set: 9.99959044631871E-4

Testing trained XOR neural network
0 XOR 0: 0.023956746649767453
0 XOR 1: 0.9736079194769579
1 XOR 0: 0.9735670067093437
1 XOR 1: 0.045068688874314006

However when it gets stuck, the total errors are decreasing, but it seems to be at a decreasing rate:

...
...
Total error for this training set: 0.12325486644721295
Total error for this training set: 0.12325486642503929
Total error for this training set: 0.12325486640286581
Total error for this training set: 0.12325486638069229
Total error for this training set: 0.12325486635851894
Total error for this training set: 0.12325486633634561
Total error for this training set: 0.1232548663141723
Total error for this training set: 0.12325486629199914
Total error for this training set: 0.12325486626982587
Total error for this training set: 0.1232548662476525
Total error for this training set: 0.12325486622547954
Total error for this training set: 0.12325486620330656
Total error for this training set: 0.12325486618113349
Total error for this training set: 0.12325486615896045
Total error for this training set: 0.12325486613678775
Total error for this training set: 0.12325486611461482
Total error for this training set: 0.1232548660924418
Total error for this training set: 0.12325486607026936
Total error for this training set: 0.12325486604809655
Total error for this training set: 0.12325486602592373
Total error for this training set: 0.12325486600375107
Total error for this training set: 0.12325486598157878
Total error for this training set: 0.12325486595940628
Total error for this training set: 0.1232548659372337
Total error for this training set: 0.12325486591506139
Total error for this training set: 0.12325486589288918
Total error for this training set: 0.12325486587071677
Total error for this training set: 0.12325486584854453

While I was reading up on neural networks I came across a discussion on local minimas and global minimas and how neural networks don't really "know" which minima its supposed to be going towards.

Is my network getting stuck in a local minima instead of a global minima?

score 7 · Accepted Answer · answered Nov 08 '11 at 07:32

7

Yes, neural networks can get stuck in local minima, depending on the error surface. However this abstract suggests that there are no local minima in the error surface of the XOR problem. However I cannot get to the full text, so I cannot verify what the authors did to proove this and how it applies to your problem.

There also might be other factors leading to this problem. For example if you descend very fast at some steep valley, if you just use a first order gradient descent, you might get to the opposite slope and bounce back and forth all the time. You could try also giving the average change over all weights at each iteration, to test if you realy have a "stuck" network, or rather one, which just has run into a limit cycle.

You should first try fiddling with your parameters (learning rate, momentum if you implemented it etc). If you can make the problem go away, by changing parameters, your algorithm is probably ok.

answered Nov 08 '11 at 07:32

LiKao

10,408
6
53
91

Thanks! Your answer made it much clearer. It appears that neural networks are not exact and that some amount of fuziness is involved. I will try changing the parameters around and try and make the problem go away. – Vivin Paliath Nov 08 '11 at 13:49
I ran across [this paper](http://www.ncbi.nlm.nih.gov/pubmed/18252598) (published in '99, a year after the one you cited) that says that there is a local minimum for the 2-3-1 XOR network (I'm using a 3-3-1 XOR network; not sure if the bias on the input layer is necessary). Again, just like in your case this is an abstract. – Vivin Paliath Nov 08 '11 at 17:13
I also saw [this paper](http://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=3&ved=0CDMQFjAC&url=http%3A%2F%2Fciteseerx.ist.psu.edu%2Fviewdoc%2Fdownload%3Fdoi%3D10.1.1.31.4770%26rep%3Drep1%26type%3Dpdf&ei=-WK5TqjAGIaviAL636jTBA&usg=AFQjCNEaQ0jG2bkD4ipXcfgXDr9mHrxRMQ&sig2=BD8IyRc8Clg2XftdR20W9w) that says that there is no minima for the simplest XOR network, but this doesn't seem to be a 2-3-1 or a 3-3-1 network. – Vivin Paliath Nov 08 '11 at 17:15

score 4 · Answer 2 · answered Sep 06 '12 at 03:35

Poor gradient descent with excessively large steps as described by LiKao is one possible problem. Another is that there are very flat regions of the XOR error landscape which means that it takes a very long time to converge, and in fact the gradient may be so weak that descent algorithm doesn't pull you in the right direction.

These two papers look at 2-1-1 and 2-2-1 XOR landscapes. One uses a "cross entropy" error function which I don't know. In the first they declare there are no local minima but in the second they say there are local minima at infinity - basically when weights run off to very large values. So for the second case, their results suggest if you don't start off near "enough" true minima you may get trapped at the infinite points. They also say that other analyses of 2-2-1 XOR networks that show no local minima are not contradicted by their results because of particular definitions.

http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.31.4770

http://www.ncbi.nlm.nih.gov/pubmed/12662806

score 2 · Answer 3 · answered Sep 05 '15 at 00:32

I encountered the same issue and found that using the activation function 1.7159*tanh(2/3*x) described in LeCun's "Efficient Backprop" paper helps. This is presumably because that function does not saturate around the target values {-1, 1}, whereas regular tanh does.

score 1 · Answer 4 · answered Dec 15 '21 at 23:13

The paper by Hamey cited in @LiKao's answer proves there are no strict "regional local minima" for XOR in a 2-2-1 neural network. However, it admits "asymptotic minima" wherein the error surface flattens out as one or more weights approach infinity.

In practice, the weights don't even need to be so large for this to happen and it is quite common for a 2-2-1 net to get stuck in this flat asymptotic region. The reason for this is saturation: the gradient of sigmoid activation approaches 0 as the weights get large, so the network is unable to keep learning.

See my notebook experiment - typically around 2 or 3 out of 10 networks end up stuck, even after 10,000 epochs. Results differ slightly if you change the learning rate, batch size, activation or loss functions, initial weights, whether inputs are created randomly or in a fixed order, etc. but usually a network gets stuck now and then.

XOR neural network error stops decreasing during training

4 Answers4

Linked