Since there is already an answer that provides a workaround I'm going to focus on problems with your approach.
Input data scale
As others have stated, your input data value range from 0 to 1000 is quite big. This problem can be easily solved by scaling your input data to zero mean and unit variance (X = (X - X.mean())/X.std()
) which will result in improved training performance. For tanh
this improvement can be explained by saturation: tanh
maps to [-1;1] and will therefore return either -1 or 1 for almost all sufficiently big (>3) x
, i.e. it saturates. In saturation the gradient for tanh
will be close to zero and nothing will be learned. Of course, you could also use ReLU
instead, which won't saturate for values > 0, however you will have a similar problem as now gradients depend (almost) solely on x
and therefore later inputs will always have higher impact than earlier inputs (among other things).
While re-scaling or normalization may be a solution, another solution would be to treat your input as a categorical input and map your discrete values to a one-hot encoded vector, so instead of
>>> X = np.arange(T)
>>> X.shape
(1000,)
you would have
>>> X = np.eye(len(X))
>>> X.shape
(1000, 1000)
Of course this might not be desirable if you want to learn continuous inputs.
Modeling
You are currently trying to model a mapping from a linear function to a non-linear function: you map f(x) = x
to g(x) = sin(x)
. While I understand that this is a toy problem, this way of modeling is limited to only this one curve as f(x)
is in no way related to g(x)
. As soon as you are trying to model different curves, say both sin(x)
and cos(x)
, with the same network you will have a problem with your X
as it has exactly the same values for both curves. A better approach of modeling this problem is to predict the next value of the curve, i.e. instead of
X = range(T)
Y = sin(x)
you want
X = sin(X)[:-1]
Y = sin(X)[1:]
so for time-step 2 you will get the y
value of time-step 1 as input and your loss expects the y
value of time-step 2. This way you implicitly model time.