10

I am trying to use Keras in order to set up a neural network, which is supposed to approximate an unknown function F(x). The resulting neural-network approximation for F(x) should then be used in order to find the minimum of the sum G(x) + F(x), where G(x) is some arbitrary function. The problem I am currently facing is, that the neural-network approximation for F(x) is not smooth enough and consequently local optimizer get stuck in tiny local Minima. I would be very thankful for any ideas in order to improve my results or to solve this problem.

Simple example: Minimizing the variance

Let me illustrate the problem on a very simplified example: I will try to teach a neural network the function F(x) = 4*var(x) with x = [x1,...,xN] and 0 <= xi <= 1, where var(x) means the variance of the vector x. Subsequently I will try to find the minimum of the neural-network representation of F(x) under the constraint, that x has a given mean value. The complete code for this example can be found in section. 3 below.

1. Creation and training of neural network

First I create a neural network for the approximation of F(x):

N = 6  # Dimension of input vector x

# Set up the neural network
model = Sequential()
model.add(Dense(50, activation='sigmoid', input_dim=N))
model.add(Dense(1, activation='sigmoid'))
model.compile(optimizer='adam', loss='mse')

Subsequently I generate training data and train the model:

# Generate training data
n_train = 100000  # Number of training samples
X_train = random((n_train, N))
y_train = 4*X_train.var(axis=1)

# Train the model
model.fit(X_train, y_train, epochs=100, batch_size=32, validation_split=0.2)

After the training is completed I test the accuracy of the model. To this aim I use test(100) (for the definition of the test function see the full code in sec. 3 below) and for my particular model I get an average error of 0.00517 which seems to be a pretty good result.

2. Minimizing F(x)

Having the neural-network approximation for F(x) I want to find its minimum under the constraint that x has a given average value. To this aim I will try the local optimizer minimize as well as the global optimizer differential_evolution from scipy.optimize.

Local optimization

I try to minimize F(x) under the constraint mean(x) = 0.5. Clearly the exact result F_min = 0, i.e., the minimum of the variance under the given constraint, would be obtained for the uniform distribution x = [0.5, 0.5, 0.5, 0.5, 0.5, 0.5]. I purposely choose a bad starting vector x0 in order to check if the optimizer can find its way to the minimum:

# Constraint
avg = 0.5  # Given average value of x
cons = {'type': 'eq', 'fun': lambda x: x.mean()-avg}

# Start vector
x0 = avg * np.array([2, 2, 2, 0, 0, 0])

# Minimization of neural-network representation
res_ML = minimize(lambda x: model.predict([[x]]), x0,
                  bounds=N*[(0, 1)], constraints=cons)

# Minimization of the exact function
res_ex = minimize(lambda x: 4*x.var(), x0,
                  bounds=N*[(0, 1)], constraints=cons)

The results for my model are as follows:

>>> res_ML.success
True

>>> res_ML.x
array([1., 1., 1., 0., 0., 0.])

>>> res_ex.x
array([0.5, 0.5, 0.5, 0.5, 0.5, 0.5])

Using the neural-network representation for F(x), the optimizer got immediately stuck. Using the exact function F(x) = 4*var(x) the optimizer found the correct result.

Global optimization

Instead of a local optimizer I was thinking to try a global optimizer. First I have tried to use shgo from scipy.optimize since it supports constraints, however it seems to be unable to find the minimum of F(x) even if the exact function is used (more details about this problem can be found here). Therefore I tried differential_evolution. Since differential_evolution does not support constraints, I enforce the condition mean(x) = 0.5 by using a penalty function:

# Minimization of neural-network representation
res2_ML = differential_evolution(lambda x: model.predict([[x]]) +
                                 1e3*(np.mean(x)-avg)**2, bounds=N*[(0, 1)])

# Minimization of the exact function
res2_ex = differential_evolution(lambda x: 4*x.var() + 1e3*(np.mean(x)-avg)**2,
                                 bounds=N*[(0, 1)])

The results I obtain are as follows:

>>> res2_ML.success
True

>>> res2_ML.x
array([0.50276561, 0.49869386, 0.49310187, 0.49895304, 0.4987404 ,
       0.50770651])

>>> res2_ex.x
array([0.5, 0.5, 0.5, 0.5, 0.5, 0.5])

>>> [float(model.predict([[x]])) for x in (res2_ML.x, res2_ex.x)]
[0.05173008516430855, 0.05170735716819763]

The result obtained by using the neural-network approximation for F(x) is already better than in the case of local optimization, but it is still not optimal. The issue is not that the model actually predicts the minimum to occur at the point res2_ML.x as can be seen from the last line, since the prediction of the model for the correct vector res2_ex.x is in fact lower. I have also tried to use tol=1e-12 in the call to differential_evolution, in order to increase the accuracy of the result, but without any noticeable improvement.

3. The complete Code

import numpy as np
from numpy.random import random
from tensorflow.keras.layers import Dense
from tensorflow.keras.models import Sequential
from scipy.optimize import minimize, differential_evolution

# Parameters
N = 6             # Lenght of input vectors
n_train = 100000  # Number of training samples

# Generate training data
X_train = random((n_train, N))
y_train = 4*X_train.var(axis=1)

# Set up and train the neural network
model = Sequential()
model.add(Dense(50, activation='sigmoid', input_dim=N))
model.add(Dense(1, activation='sigmoid'))
model.compile(optimizer='adam', loss='mse')
model.fit(X_train, y_train, epochs=100, batch_size=32, validation_split=0.2)


# ############################ Local Optimization #############################


# Constraint
avg = 0.5  # Given average value of x
cons = {'type': 'eq', 'fun': lambda x: x.mean()-avg}

# Start vector
x0 = avg * np.array([2, 2, 2, 0, 0, 0])

# Minimization of neural-network representation
res_ML = minimize(lambda x: model.predict([[x]]), x0,
                  bounds=N*[(0, 1)], constraints=cons)

# Minimization of the exact function
res_ex = minimize(lambda x: 4*x.var(), x0,
                  bounds=N*[(0, 1)], constraints=cons)


# ########################### Global Optimization #############################


# Minimization of neural-network representation using differential_evolution
res2_ML = differential_evolution(lambda x: model.predict([[x]]) +
                                 1e3*(np.mean(x)-avg)**2, bounds=N*[(0, 1)],
                                 tol=1e-6)

# Minimization of neural-network representation using shgo
res3_ML = shgo(lambda x: model.predict([[x]]), bounds=N*[(0, 1)],
               constraints=cons, sampling_method='sobol')

# Minimization of the exact function
res2_ex = differential_evolution(lambda x: 4*x.var() + 1e3*(np.mean(x)-avg)**2,
                                 bounds=N*[(0, 1)])


# ############################# Helper Function ###############################


def test(n_test):
    '''
    Function for testing the model.
    '''
    x = random((n_test, N))             # Test data
    pred = model.predict([x])           # Model prediction
    exct = 4*x.var(axis=1)              # Exact values
    diff = np.abs(pred.flatten()-exct)  # Difference
    # Print the test results to screen
    print('\npred.   | exact   | diff.')
    print('---------------------------')
    for k in range(n_test):
        print('%.5f | %.5f | %.5f' % (pred[k], exct[k], diff[k]))
    print('---------------------------')
    print('       avg. error | %.5f' % diff.mean())

Update

I made a mistake with the tol parameter for the differential_evolution method. Many thanks to Romeo Valentin for pointing this out. I have corrected this mistake in sec. 3. Using the tol parameter correctly definitely does improve the results of differential_evolution.

Furthermore, after posting an issue regarding the shgo optimizer on github, it turned out that the shgo optimizer works well, if the sobol sampling method is used:

# Minimization of neural-network representation using shgo
res3_ML = shgo(lambda x: model.predict([[x]]), bounds=N*[(0, 1)],
               constraints=cons, sampling_method='sobol')

Using this sampling method the result is perfect:

>>> res3_ML.success
True

>>> res3_ML.x
array([0.5, 0.5, 0.5, 0.5, 0.5, 0.5])

I have added the minimization using shgo to the full code in sec. 3.

I regard this problem as basically solved. However, I am still wondering, if my neural network is really the right choice for this kind of task, or if there are more advanced structures or activation functions, which could yield a smooth(er) function approximation.

Gulzar
  • 23,452
  • 27
  • 113
  • 201
TMueller83
  • 412
  • 3
  • 11
  • 2
    what data do you have for the training? how large is it? this is a very important thing to know before solving your optimization problem with NN. – smerllo Jul 09 '19 at 03:22
  • 1
    Thank you for your comment Cs20. The training data is described in section 1., where I generate 100,000 training samples. The data format as well as the bounds and constraints are the same in my real-world problem, just that the input dimension, i.e., the length of the vector `x` could be larger. There are smaller and larger real-world problems. For the small ones I can also obtain 100,000 training samples, for the larger ones probably less. – TMueller83 Jul 09 '19 at 09:25
  • Is your point that `res2_ex.x` results in a slightly lower value than `res2_ML.x`, so you wonder why you didn't get that point instead? – Romeo Valentin Jul 10 '19 at 14:36
  • 2
    Assuming that is indeed your problem, reducing the tolerance seems like a good idea. In the code provided in Section 3 you put the tolerance in the wrong line though - it should instead go to the minimization for `res2_ML` if I'm not mistaken. – Romeo Valentin Jul 10 '19 at 14:46
  • Also what happens when you run the `shgo` algorithm with slightly different bounds as suggested in the linked SO thread? – Romeo Valentin Jul 10 '19 at 15:02
  • After running the code you provided, but putting the `tol` option in the right line, the ML prediction is indeed lower than the exact prediction (when evaluated on the model) – Romeo Valentin Jul 10 '19 at 15:11
  • Hi Romeo. Thank you very much for your comments and for pointing out my mistake with the `tol` parameter. That was a dumb mistake from me :D I have updated my post correcting this mistake and added some new findings regarding the `shgo` method. – TMueller83 Jul 10 '19 at 20:26
  • In regards on whether a NN is the right kind of task: What do you want to achieve? The optimization procedure can be executed on any function, i.e. also the original one. The NN has the advantage that once set up, evaluation is very cheap, whereas the original model might be expensive to evaluate. On the other hand, you need to generate your training labels. The question becomes if you are willing to exchange rapid online evaluation (for example for optimization) for very slow offline label generation. – Romeo Valentin Jul 11 '19 at 08:19
  • Hi Romeo. That's exactly what I want to achieve: Having a representation of `F(x)` that can be used for quick optimization. Using the global optimizer correctly this works well now. However, before asking this question I came across an approach using radial basis functions (RBF) as neuron activation functions, which (if I well understood) should result in a smooth function approximation. Since to my knowledge Keras per default does not support RBFs I didn't implement this yet. Therefore I asked if there are maybe other, more appropriate techniques for smooth function approximation. – TMueller83 Jul 11 '19 at 09:41
  • Interesting idea. I haven't looked into RBF myself. But why do you not just optimize on the original function? You probably need less function evaluations that way than by generating enough labels for the NN approximation. – Romeo Valentin Jul 13 '19 at 12:07
  • Hi Romeo. Thank you for your comment. The problem with my real world function `F(x)` is, that it is impossible to compute it right away. I can only compute matching pairs `(F,x)` as output of a program. Changing the parameters in the program yields different pairs `(F,x)`, that is how I generate my training data. However, one can prove, that there is in fact a functional relation `F(x)` which I am trying to reconstruct using the neural network. Therefore I can not optimize the original function `F(x)`. – TMueller83 Jul 15 '19 at 12:17
  • @TMueller83, If I understand your problem correctly, I am wondering why you didn't incorporate any of this optimization function with your mse metric and use it in model.compile ?. Also, the results of any of functions don't update weights in NN ? – Yasmin Aug 05 '19 at 09:26
  • @Yasmin thank you for your comment. Unfortunately I do not really understand what you mean by incorporating the optimization functions in my mse metric. Could you please explain a bit more, what you mean? – TMueller83 Aug 07 '19 at 13:15
  • @TMueller83, sorry if I wasn't clear, How do you use or update your network using the output of res_ML ?. I mean applying scipy.optimize.shgo on the prediction of the model (model.predict) doesn't mean the model itself is changing based on the output of the optimizer ?. I am hoping to optimize a custom loss function using scipy optimizer – Yasmin Aug 13 '19 at 11:49
  • 1
    Hi @Yasmin, I do not update my model at all after the calculation of `res_ML`. The model is updated only during the training phase and afterwards it does not change anymore. The optimization using `shgo` and other optimizer is done in order to figure out, how well one can perform local or global optimization tasks on target functions that involve neural networks (in my simple example the target function is just a neural network). It turns out, that local optimization probably does not yield the desired results. Global optimization can however work pretty well. – TMueller83 Aug 13 '19 at 23:19
  • @TMueller83, many Thanks, would you be able to give your insights on that question, please. I understand that you don't update your model, but I am interested to check if x and y are between boundries, where x+y is my loss function.. https://stackoverflow.com/questions/57356381/how-to-use-scipy-optimize-minimize-within-keras-model-compile ? – Yasmin Aug 14 '19 at 17:52

0 Answers0