2

Graduate student, new to Keras and neural networks was trying to fit a very simple feedforward neural network to a one-dimensional sine.

Below are three examples of the best fit that I can get. On the plots, you can see the output of the network vs ground truth

Neural network output vs Ground truth (run 3)

Neural network output vs Ground truth (run 1)
Neural network output vs Ground truth (run 2)

The complete code, just a few lines, is posted here example Keras


I was playing with the number of layers, different activation functions, different initializations, and different loss functions, batch size, number of training samples. It seems that none of those were able to improve the results beyond the above examples.

I would appreciate any comments and suggestions. Is sine a hard function for a neural network to fit? I suspect that the answer is not, so I must be doing something wrong...


There is a similar question here from 5 years ago, but the OP there didn't provide the code and it is still not clear what went wrong or how he was able to resolve this problem.

them
  • 218
  • 4
  • 14
  • How "deep" does your network go? ANN's are usually excellent at non-linear problems, if and only if, the depth and width of the network is great enough. Two or three layers are excellent but single layers are dumb. Also, you might not have enough "resolution" in a given layer which is processing the signal (or width) – Patrick Sturm Oct 28 '17 at 22:47
  • @PatrickSturm I tried anything between 3 to 10 layers with 10 or more neurons in each layer. Above say ~4 layers the difference was small. – them Oct 28 '17 at 22:49
  • That is definitely enough so I'm stumped. May I ask how many training examples it was given? – Patrick Sturm Oct 28 '17 at 22:50
  • @PatrickSturm 1000 to 10000, no difference... The code is here https://gist.github.com/anonymous/c816eb15daf949543e96dd2c64174670 – them Oct 28 '17 at 22:51
  • The only "idea" I have left is, perhaps the training data, by mistake, never actually reaches a negative value, a "quick sim" almost suggests such a circumstance but without thorough testing, is difficult to say for certain – Patrick Sturm Oct 28 '17 at 22:56
  • It is "true", that with the algorithm\r\n T = 1000 X = numpy.array(range(T)) Y = numpy.sin(3.5 * numpy.pi * X / T), I can never reach a negative value – Patrick Sturm Oct 28 '17 at 22:58
  • 1
    Try to normalize input values, e.g. from -1 to 1. Also scale down the output values, as the tanh has difficulties with values close to +/- 1 – BlackBear Oct 28 '17 at 22:59
  • @PatrickSturm +BlackBear. Thanks for your suggestions, I tried to limit the amplitude of the sine function, and I just added a plot where the well-fitted part is actually negative so I don't think it is related to "sign" or the amplitude (as long as it is in the range of the final output neuron). I tried 'ReLU' and other activations. – them Oct 28 '17 at 23:07

2 Answers2

9

In order to make your code work, you need to:

  • scale the input values in the [-1, +1] range (neural networks don't like big values)
  • scale the output values as well, as the tanh activation doesn't work too well close to +/-1
  • use the relu activation instead of tanh in all but the last layer (converges way faster)

With these modifications, I was able to run your code with two hidden layers of 10 and 25 neurons

BlackBear
  • 22,411
  • 10
  • 48
  • 86
  • thanks, I tried your suggestions but this does not help. – them Oct 28 '17 at 23:11
  • @them well I started from your code.. I don't know how else to help – BlackBear Oct 28 '17 at 23:12
  • Thanks for your suggestion! So was you able to approximate the sine starting from my code with the suggested changes? I tried to what you suggested but it didn't help... – them Oct 28 '17 at 23:13
  • can you please post your code, I'd appreciate it very much! – them Oct 28 '17 at 23:17
  • Thanks! The change that made all the difference was the line where I set the range of X to [0, 1], 'X = numpy.arange(-1, 1, 0.001)'. – them Oct 28 '17 at 23:22
4

Since there is already an answer that provides a workaround I'm going to focus on problems with your approach.

Input data scale

As others have stated, your input data value range from 0 to 1000 is quite big. This problem can be easily solved by scaling your input data to zero mean and unit variance (X = (X - X.mean())/X.std()) which will result in improved training performance. For tanh this improvement can be explained by saturation: tanh maps to [-1;1] and will therefore return either -1 or 1 for almost all sufficiently big (>3) x, i.e. it saturates. In saturation the gradient for tanh will be close to zero and nothing will be learned. Of course, you could also use ReLU instead, which won't saturate for values > 0, however you will have a similar problem as now gradients depend (almost) solely on x and therefore later inputs will always have higher impact than earlier inputs (among other things).

While re-scaling or normalization may be a solution, another solution would be to treat your input as a categorical input and map your discrete values to a one-hot encoded vector, so instead of

>>> X = np.arange(T)
>>> X.shape
(1000,)

you would have

>>> X = np.eye(len(X))
>>> X.shape
(1000, 1000)

Of course this might not be desirable if you want to learn continuous inputs.

Modeling

You are currently trying to model a mapping from a linear function to a non-linear function: you map f(x) = x to g(x) = sin(x). While I understand that this is a toy problem, this way of modeling is limited to only this one curve as f(x) is in no way related to g(x). As soon as you are trying to model different curves, say both sin(x) and cos(x), with the same network you will have a problem with your X as it has exactly the same values for both curves. A better approach of modeling this problem is to predict the next value of the curve, i.e. instead of

X = range(T)
Y = sin(x)

you want

X = sin(X)[:-1]
Y = sin(X)[1:]

so for time-step 2 you will get the y value of time-step 1 as input and your loss expects the y value of time-step 2. This way you implicitly model time.

nemo
  • 55,207
  • 13
  • 135
  • 135
  • thanks for your explanation! I think I understood all your comments, except, that I didn't entirely understand why the rescaling of inputs is always zero mean and unit variance. Since the input is multiplied by the weight in the first node doesn't rescale should match the random initialization of the weights? – them Oct 29 '17 at 02:30
  • Re-scaling using [z-score](https://en.wikipedia.org/wiki/Standard_score) is only one way to achieve an input range that gives you a range (close to) [-1;1], there are other techniques, e.g. Min-Max-Scaling, which will work better depending on the nature of your data (a hard maximum is better enforced with min-max-scaling). It mostly depends on your *data* and your weights will adapt (to a certain degree) as long as your input data is not distorted. So it is much more important to choose a scaling method that does not introduce perturbations than fiddle with initialization. – nemo Oct 29 '17 at 10:56