Why deep NN can't approximate simple ln(x) function?

Question

I have created ANN with two RELU hidden layers + linear activation layer and trying to approximate simple ln(x) function. And I am can't do this good. I am confused because lx(x) in x:[0.0-1.0] range should be approximated without problems (I am using learning rate 0.01 and basic grad descent optimization).

import tensorflow as tf
import numpy as np

def GetTargetResult(x):
    curY = np.log(x)
    return curY

# Create model
def multilayer_perceptron(x, weights, biases):
    # Hidden layer with RELU activation
    layer_1 = tf.add(tf.matmul(x, weights['h1']), biases['b1'])
    layer_1 = tf.nn.relu(layer_1)
    # # Hidden layer with RELU activation
    layer_2 = tf.add(tf.matmul(layer_1, weights['h2']), biases['b2'])
    layer_2 = tf.nn.relu(layer_2)

    # Output layer with linear activation
    out_layer = tf.matmul(layer_2, weights['out']) + biases['out']
    return out_layer

# Parameters
learning_rate = 0.01
training_epochs = 10000
batch_size = 50
display_step = 500

# Network Parameters
n_hidden_1 = 50 # 1st layer number of features
n_hidden_2 = 10 # 2nd layer number of features
n_input =  1


# Store layers weight & bias
weights = {
    'h1': tf.Variable(tf.random_uniform([n_input, n_hidden_1])),
    'h2': tf.Variable(tf.random_uniform([n_hidden_1, n_hidden_2])),
    'out': tf.Variable(tf.random_uniform([n_hidden_2, 1]))
}
biases = {
    'b1': tf.Variable(tf.random_uniform([n_hidden_1])),
    'b2': tf.Variable(tf.random_uniform([n_hidden_2])),
    'out': tf.Variable(tf.random_uniform([1]))
}

x_data = tf.placeholder(tf.float32, [None, 1])
y_data = tf.placeholder(tf.float32, [None, 1])

# Construct model
pred = multilayer_perceptron(x_data, weights, biases)

# Minimize the mean squared errors.
loss = tf.reduce_mean(tf.square(pred - y_data))
optimizer = tf.train.GradientDescentOptimizer(learning_rate)
train = optimizer.minimize(loss)

# Before starting, initialize the variables.  We will 'run' this first.
init = tf.initialize_all_variables ()

# Launch the graph.
sess = tf.Session()
sess.run(init)

for step in range(training_epochs):
    x_in = np.random.rand(batch_size, 1).astype(np.float32)
    y_in = GetTargetResult(x_in)
    sess.run(train, feed_dict = {x_data: x_in, y_data: y_in})
    if(step % display_step == 0):
        curX = np.random.rand(1, 1).astype(np.float32)
        curY =  GetTargetResult(curX)

        curPrediction = sess.run(pred, feed_dict={x_data: curX})
        curLoss = sess.run(loss, feed_dict={x_data: curX, y_data: curY})
        print("For x = {0} and target y = {1} prediction was y = {2} and squared loss was = {3}".format(curX, curY,curPrediction, curLoss))

For the configuration above NN is just learning to guess y = -1.00. I have tried different learning rates, couple optimizers and different configurations with no success - learning does not converge in any case. I did something like that with logarithm in past in other deep learning framework without problem.. Can be the TF specific issue? What am I doing wrong?

Isn't ln(0.0) euqal to negative infinity? – Aaron Jan 10 '17 at 02:48 — Aaron, Jan 10 '17 at 02:48

Martin Thoma · Accepted Answer · 2017-01-10T10:26:04.107

What your network has to predict

Source: WolframAlpha

What your architecture is

ReLU(ReLU(x * W_1 + b_1) * W_2 + b_2)*W_out + b_out

Thoughts

My first thought was that ReLU is the problem. However, you don't apply relu to the output, so that should not cause the problem.

Changing the initialization (from uniform to normal) and the Optimizer (from SGD to ADAM) seems to fix the problem:

#!/usr/bin/env python
import tensorflow as tf
import numpy as np


def get_target_result(x):
    return np.log(x)


def multilayer_perceptron(x, weights, biases):
    """Create model."""
    # Hidden layer with RELU activation
    layer_1 = tf.add(tf.matmul(x, weights['h1']), biases['b1'])
    layer_1 = tf.nn.relu(layer_1)
    # # Hidden layer with RELU activation
    layer_2 = tf.add(tf.matmul(layer_1, weights['h2']), biases['b2'])
    layer_2 = tf.nn.relu(layer_2)

    # Output layer with linear activation
    out_layer = tf.matmul(layer_2, weights['out']) + biases['out']
    return out_layer

# Parameters
learning_rate = 0.01
training_epochs = 10**6
batch_size = 500
display_step = 500

# Network Parameters
n_hidden_1 = 50  # 1st layer number of features
n_hidden_2 = 10  # 2nd layer number of features
n_input = 1


# Store layers weight & bias
weights = {
    'h1': tf.Variable(tf.truncated_normal([n_input, n_hidden_1], stddev=0.1)),
    'h2': tf.Variable(tf.truncated_normal([n_hidden_1, n_hidden_2], stddev=0.1)),
    'out': tf.Variable(tf.truncated_normal([n_hidden_2, 1], stddev=0.1))
}

biases = {
    'b1': tf.Variable(tf.constant(0.1, shape=[n_hidden_1])),
    'b2': tf.Variable(tf.constant(0.1, shape=[n_hidden_2])),
    'out': tf.Variable(tf.constant(0.1, shape=[1]))
}

x_data = tf.placeholder(tf.float32, [None, 1])
y_data = tf.placeholder(tf.float32, [None, 1])

# Construct model
pred = multilayer_perceptron(x_data, weights, biases)

# Minimize the mean squared errors.
loss = tf.reduce_mean(tf.square(pred - y_data))
optimizer = tf.train.GradientDescentOptimizer(learning_rate)
# train = optimizer.minimize(loss)
train = tf.train.AdamOptimizer(1e-4).minimize(loss)

# Before starting, initialize the variables.  We will 'run' this first.
init = tf.initialize_all_variables()

# Launch the graph.
sess = tf.Session()
sess.run(init)

for step in range(training_epochs):
    x_in = np.random.rand(batch_size, 1).astype(np.float32)
    y_in = get_target_result(x_in)
    sess.run(train, feed_dict={x_data: x_in, y_data: y_in})
    if(step % display_step == 0):
        curX = np.random.rand(1, 1).astype(np.float32)
        curY = get_target_result(curX)

        curPrediction = sess.run(pred, feed_dict={x_data: curX})
        curLoss = sess.run(loss, feed_dict={x_data: curX, y_data: curY})
        print(("For x = {0} and target y = {1} prediction was y = {2} and "
               "squared loss was = {3}").format(curX, curY,
                                                curPrediction, curLoss))

Training this for 1 minute gave me:

For x = [[ 0.19118255]] and target y = [[-1.65452647]] prediction was y = [[-1.65021849]] and squared loss was = 1.85587377928e-05
For x = [[ 0.17362741]] and target y = [[-1.75084364]] prediction was y = [[-1.74087048]] and squared loss was = 9.94640868157e-05
For x = [[ 0.60853624]] and target y = [[-0.4966988]] prediction was y = [[-0.49964082]] and squared loss was = 8.65551464813e-06
For x = [[ 0.33864763]] and target y = [[-1.08279514]] prediction was y = [[-1.08586168]] and squared loss was = 9.4036658993e-06
For x = [[ 0.79126364]] and target y = [[-0.23412406]] prediction was y = [[-0.24541236]] and squared loss was = 0.000127425722894
For x = [[ 0.09994856]] and target y = [[-2.30309963]] prediction was y = [[-2.29796076]] and squared loss was = 2.6408026315e-05
For x = [[ 0.31053194]] and target y = [[-1.16946852]] prediction was y = [[-1.17038012]] and squared loss was = 8.31002580526e-07
For x = [[ 0.0512077]] and target y = [[-2.97186542]] prediction was y = [[-2.96796203]] and squared loss was = 1.52364455062e-05
For x = [[ 0.120253]] and target y = [[-2.11815739]] prediction was y = [[-2.12729549]] and squared loss was = 8.35050013848e-05

So the answer might be that your optimizer is not good / the optimization problem starts at a bad point. See

Xavier Glorot, Yoshua Bengio: Understanding the difficulty of training deep feedforward neural networks
Visualizing Optimization Algos

The following image is from Alec Radfords nice gifs. It does not contain ADAM, but you get a feeling for how much better one can do than SGD:

Two idea how this might be improved

try dropout
try not to use x values close to 0. I would rather sample values in [0.01, 1].

However, my experience with regression problems is quite limited.

score 0 · Answer 2 · answered Jan 10 '17 at 08:12

First of all, your input data is in range [0, 1), which is not a good input to a neural network. Subtract mean from x after computing y to make it normalized (also ideally divide by standard deviation).

However, in your particular case it was not enough to make it work.

I played with it and found two ways to make it work (both require data normalization as described above):

Either completely remove the second layer

or

Make the number of neurons in the second layer 50.

My guess would be that 10 neurons do not have sufficient representation power to pass enough information to the last layer (obviously, a perfectly smart NN would learn to ignore the second layer in this case passing the answer in one of the neurons, but the theoretical possibility doesn't mean that gradient descent will learn to do so).

Cuadra · Answer 3 · 2017-01-10T04:22:07.540

I have not look at the code but this is the theory. If you use an activation function like "tanh", then for small weights the activation function is in the linear region and for large weights the activation function is either -1 or +1. If you are in the linear region across all layers then you can not approximate complex functions (i.e. you have a sandwich of linear layers hence the best you can do is linear aproximations) but if you have bigger weights then the nonlinearly allow you to approximate a wide range of functions. There are no free lunches, the weights need to be at the right values to avoid over-fitting and under-fitting. This process is called regularization.

Why deep NN can't approximate simple ln(x) function?

3 Answers3

What your network has to predict

What your architecture is

Thoughts

Linked