19

I made a simple module that should figure out the relationship between input and output numbers, in this case, x and x squared. The code in Python:

import numpy as np
import tensorflow as tf

# TensorFlow only log error messages.
tf.logging.set_verbosity(tf.logging.ERROR)

features = np.array([-10, -9, -8, -7, -6, -5, -4, -3, -2, -1, 0, 1, 2, 3, 4, 5, 6, 7, 8,
                    9, 10], dtype = float)
labels = np.array([100, 81, 64, 49, 36, 25, 16, 9, 4, 1, 0, 1, 4, 9, 16, 25, 36, 49, 64,
                    81, 100], dtype = float)

model = tf.keras.Sequential([
    tf.keras.layers.Dense(units = 1, input_shape = [1])
])

model.compile(loss = "mean_squared_error", optimizer = tf.keras.optimizers.Adam(0.0001))
model.fit(features, labels, epochs = 50000, verbose = False)
print(model.predict([4, 11, 20]))

I tried a different number of units, and adding more layers, and even using the relu activation function, but the results were always wrong. It works with other relationships like x and 2x. What is the problem here?

desertnaut
  • 57,590
  • 26
  • 140
  • 166
Ameer Taweel
  • 949
  • 1
  • 11
  • 37

4 Answers4

27

You are making two very basic mistakes:

  • Your ultra-simple model (a single-layer network with a single unit) hardly qualifies as a neural network at all, let alone a "deep learning" one (as your question is tagged)
  • Similarly, your dataset (just 20 samples) is also ultra-small

It is certainly understood that neural networks need to be of some complexity if they are to solve problems even as "simple" as x*x; and where they really shine is when fed with large training datasets.

The methodology when trying to solve such function approximations is not to just list the (few possible) inputs and then fed to the model, along with the desired outputs; remember, NNs learn through examples, and not through symbolic reasoning. And the more examples the better. What we usually do in similar cases is to generate a large number of examples, which we subsequently feed to the model for training.

Having said that, here is a rather simple demonstration of a 3-layer neural network in Keras for approximating the function x*x, using as input 10,000 random numbers generated in [-50, 50]:

import numpy as np
import keras
from keras.models import Sequential
from keras.layers import Dense
from keras.optimizers import Adam
from keras import regularizers
import matplotlib.pyplot as plt

model = Sequential()
model.add(Dense(8, activation='relu', kernel_regularizer=regularizers.l2(0.001), input_shape = (1,)))
model.add(Dense(8, activation='relu', kernel_regularizer=regularizers.l2(0.001)))
model.add(Dense(1))

model.compile(optimizer=Adam(),loss='mse')

# generate 10,000 random numbers in [-50, 50], along with their squares
x = np.random.random((10000,1))*100-50
y = x**2

# fit the model, keeping 2,000 samples as validation set
hist = model.fit(x,y,validation_split=0.2,
             epochs= 15000,
             batch_size=256)

# check some predictions:
print(model.predict([4, -4, 11, 20, 8, -5]))
# result:
[[ 16.633354]
 [ 15.031291]
 [121.26833 ]
 [397.78638 ]
 [ 65.70035 ]
 [ 27.040245]]

Well, not that bad! Remember that NNs are function approximators: we should expect them neither to exactly reproduce the functional relationship nor to "know" that the results for 4 and -4 should be identical.

Let's generate some new random data in [-50,50] (remember, for all practical purposes, these are unseen data for the model) and plot them, along with the original ones, to get a more general picture:

plt.figure(figsize=(14,5))
plt.subplot(1,2,1)
p = np.random.random((1000,1))*100-50 # new random data in [-50, 50]
plt.plot(p,model.predict(p), '.')
plt.xlabel('x')
plt.ylabel('prediction')
plt.title('Predictions on NEW data in [-50,50]')

plt.subplot(1,2,2)
plt.xlabel('x')
plt.ylabel('y')
plt.plot(x,y,'.')
plt.title('Original data')

Result:

enter image description here

Well, it arguably does look like a good approximation indeed...

You could also take a look at this thread for a sine approximation.

The last thing to keep in mind is that, although we did get a decent approximation even with our relatively simple model, what we should not expect is extrapolation, i.e. good performance outside [-50, 50]; for details, see my answer in Is deep learning bad at fitting simple non linear functions outside training scope?

desertnaut
  • 57,590
  • 26
  • 140
  • 166
  • So you mean to get more accurate results on numbers outside `[-50, 50]`, I should train it and give it examples on the range I need. – Ameer Taweel Mar 17 '19 at 14:03
  • 1
    @AmeerTaweel indeed – desertnaut Mar 17 '19 at 14:56
  • 1
    But how did you know an 8/8/1 NN would do the trick? – Tony Ennis Oct 11 '21 at 00:33
  • @TonyEnnis I didn't (and the question was not actually about this). I just knew that a single-layer single-unit model, like OP's, would not do the job; the rest was simple trial & error, based on experience (can't remember now, but I think I got it with the first try). – desertnaut Oct 11 '21 at 07:37
14

The problem is that x*x is a very different beast than a*x.

Please note what a usual "neural network" does: it stacks y = f(W*x + b) a few times, never multiplying x with itself. Therefore, you'll never get perfect reconstruction of x*x. Unless you set f(x) = x*x or similar.

What you can get is an approximation in the range of values presented during training (and perhaps a very little bit of extrapolation). Anyway, I'd recommend you to work with a smaller range of values, it will be easier to optimize the problem.

And on a philosophical note: In machine learning, I find it more useful to think of good/bad, rather than correct/wrong. Especially with regression, you cannot get the result "right" unless you have the exact model. In which case there is nothing to learn.


There actually are some NN architectures multiplying f(x) with g(x), most notably LSTMs and Highway networks. But even these have one or both of f(x), g(s) bounded (by logistic sigmoid or tanh), thus are unable to model x*x fully.


Since there is some misunderstanding expressed in comments, let me emphasize a few points:

  1. You can approximate your data.
  2. To do well in any sense, you do need a hidden layer.
  3. But no more data is necessary, though if you cover the space, the model will fit more closely, see desernaut's answer.

As an example, here is a result from a model with a single hidden layer of 10 units with tanh activation, trained by SGD with learning rate 1e-3 for 15k iterations to minimize the MSE of your data. Best of five runs:

Performance of a simple NN trained on OP's data

Here is the full code to reproduce the result. Unfortunately, I cannot install Keras/TF in my current environment, but I hope that the PyTorch code is accessible :-)

#!/usr/bin/env python
import torch
import torch.nn as nn
import matplotlib.pyplot as plt

X = torch.tensor([range(-10,11)]).float().view(-1, 1)
Y = X*X

model = nn.Sequential(
    nn.Linear(1, 10),
    nn.Tanh(),
    nn.Linear(10, 1)
)

optimizer = torch.optim.SGD(model.parameters(), lr=1e-3)
loss_func = nn.MSELoss()
for _ in range(15000):
    optimizer.zero_grad()
    pred = model(X)
    loss = loss_func(pred, Y)
    loss.backward()
    optimizer.step()

x = torch.linspace(-12, 12, steps=200).view(-1, 1)
y = model(x)
f = x*x

plt.plot(x.detach().view(-1).numpy(), y.detach().view(-1).numpy(), 'r.', linestyle='None')
plt.plot(x.detach().view(-1).numpy(), f.detach().view(-1).numpy(), 'b')
plt.show()
dedObed
  • 1,313
  • 1
  • 11
  • 19
  • But since the neural network will never multiply `x` with itself, then this method won't work to solve the problem. How to make it multiply `x` by itself **without explicitly** telling it to do so since it won't be `Machine Learning` if I tell it what to do. – Ameer Taweel Mar 14 '19 at 19:49
  • 3
    Well... this is not a problem to solve with ML in the first place ;-) If learning, play around fitting data with different amount of noise. Plot the predictions and watch how different models can fit in different ranges of data. Overall: despite all the recent hype, the so called neural network are just parametrized functions of the input. So you do give them some structure in any case. If there is no multiplication between inputs, inputs will never be multiplied. If you know/suspect that your task needs them to be multiplied, tell the network to do so. – dedObed Mar 14 '19 at 20:07
1

I'd like to add to desernaut's answer and dedObed's answer just because I got an interesting result.

The result is from running desertnuat's exact code, with training between -50 and +50, but testing the results between -70 and +70.enter image description here

What's interesting is that, outside of the training range, the network is extrapolating close to the gradient of x**2 at the edge of the training data. i.e. 90.5 ~= 2*x = 100 on the right (and -100 on the left).

The reason this is different from dedObed's answer is probably that they are using tanh activation, rather than relu.

As tanh forces everything between -1 and +1 the values outside of the training range just get mapped close to -1 and +1 before the final linear layer, so the output for an input of 12 ends up very similar to the output for an input of 11 (given that model's training range).

Relu on the other hand just sets negative values to zero and it seems the network might be approximating the gradient for a local region, hence the behaviour in the extrapolated region.

For the sake of completeness, here is the code (credit desertnaut):

def train_keras_model(x, y):
    model = Sequential()
    model.add(Dense(8, activation='relu', kernel_regularizer=regularizers.l2(0.001), input_shape=(1,)))
    model.add(Dense(8, activation='relu', kernel_regularizer=regularizers.l2(0.001)))
    model.add(Dense(1))

    model.compile(optimizer=Adam(), loss='mse')

    # fit the model, keeping 2,000 samples as validation set
    hist = model.fit(x, y,
                     validation_split=0.2,
                     epochs=15000,
                     batch_size=256)
    model.save(f"Regressor_model")
    return model

def main():
    low = -50
    high = 50
    x, y = generated_x_squared_data(low, high, 10000)
    model = train_keras_model(x, y)
    x_test, y_test = generated_x_squared_data(-70, 70, 10000)

    y_pred = model.predict(x_test)

    create_frame(0, x_test, y_test, y_pred, low, high, 15000, out_file="x_squared_keras.png")
Titus Buckworth
  • 382
  • 3
  • 12
0

My answer is a bit different. For the trivial case x*x, you can just write your own activation function that takes in x and outputs x*x. This answers the question above, "how to build a NN that calcuates x*x?". But this may violate the "spirit" of the question.

I mention this because sometimes you want to perform a non-trivial operation like
(x --> exp[A * x*x] * sinh[ 1/sqrt( log(k * x)) ] ).\ You could write an activation function for this, but the back propagation operation would be hellish and impenetrable to another developer.

AND suppose you also want the function
(x --> exp[A * x*x] * cosh[ 1/sqrt( log(k * x) ) ]).
Writing another stand-alone activation function would just be wasteful.

For this reason, you might want to build a library of activation functions with atomic operations like, z*z, exp(z), sinh(z), cosh(z), sqrt(z), log(z). These activation functions would be applied one at a time with the help of auxiliary network layers consisting of passthrough (i.e. no-op) nodes.

Gerry Harp
  • 49
  • 8