1

After spending days failing to use neural network for Q learning, I decided to go back to the basics and do a simple function approximation to see if everything was working correctly and see how some parameters affected the learning process. Here is the code that I came up with

from keras.models import Sequential
from keras.layers import Dense
import matplotlib.pyplot as plt
import random
import numpy
from sklearn.preprocessing import MinMaxScaler

regressor = Sequential()
regressor.add(Dense(units=20, activation='sigmoid', kernel_initializer='uniform', input_dim=1))
regressor.add(Dense(units=20, activation='sigmoid', kernel_initializer='uniform'))
regressor.add(Dense(units=20, activation='sigmoid', kernel_initializer='uniform'))
regressor.add(Dense(units=1))
regressor.compile(loss='mean_squared_error', optimizer='sgd')
#regressor = ExtraTreesRegressor()

N = 5000
X = numpy.empty((N,))
Y = numpy.empty((N,))

for i in range(N):
    X[i] = random.uniform(-10, 10)
X = numpy.sort(X).reshape(-1, 1)

for i in range(N):
    Y[i] = numpy.sin(X[i])
Y = Y.reshape(-1, 1)

X_scaler = MinMaxScaler()
Y_scaler = MinMaxScaler()
X = X_scaler.fit_transform(X)
Y = Y_scaler.fit_transform(Y)

regressor.fit(X, Y, epochs=2, verbose=1, batch_size=32)
#regressor.fit(X, Y.reshape(5000,))

x = numpy.mgrid[-10:10:100*1j]
x = x.reshape(-1, 1)
y = numpy.mgrid[-10:10:100*1j]
y = y.reshape(-1, 1)
x = X_scaler.fit_transform(x)

for i in range(len(x)):
    y[i] = regressor.predict(numpy.array([x[i]]))

plt.figure()
plt.plot(X_scaler.inverse_transform(x), Y_scaler.inverse_transform(y))
plt.plot(X_scaler.inverse_transform(X), Y_scaler.inverse_transform(Y))

The problem is that all my predictions are around 0 in value. As you can see I used an ExtraTreesRegressor from sklearn (commented lines) to check that the protocol is actually correct. So what is wrong with my neural network ? Why is it not working ?

(The actual problem that I'm trying to solve is to compute the Q function for the mountain car problem using neural network. How is it different from this function approximator ?)

desertnaut
  • 57,590
  • 26
  • 140
  • 166
user3548298
  • 186
  • 1
  • 1
  • 13
  • Not sure what the learning rate here is, but does it get better if you train for more epochs? – bantmen Mar 31 '18 at 02:05
  • I can't see any difference when I set it to 100 – user3548298 Mar 31 '18 at 02:06
  • Using the sigmoid activation will lead to vanishing gradient problems, just don't use it. – Dr. Snoopy Mar 31 '18 at 10:49
  • I managed to get an approximation using a wider network and reducing the batch size while increasing the epochs, with the sigmoid. I just tried the same configuration but with relu and it doesn't work at all. What do you suggess for the activation? – user3548298 Mar 31 '18 at 11:38

3 Answers3

5

With these changes:

  • Activations to relu
  • Remove kernel_initializer (i.e. leave the default 'glorot_uniform')
  • Adam optimizer
  • 100 epochs

i.e.

regressor = Sequential()
regressor.add(Dense(units=20, activation='relu', input_dim=1)) 
regressor.add(Dense(units=20, activation='relu')) 
regressor.add(Dense(units=20, activation='relu')) 
regressor.add(Dense(units=1))
regressor.compile(loss='mean_squared_error', optimizer='adam')

regressor.fit(X, Y, epochs=100, verbose=1, batch_size=32)

and the rest of your code unchanged, here is the result:

enter image description here

Tinker, again and again...

desertnaut
  • 57,590
  • 26
  • 140
  • 166
  • Could you develop the "tinker" thing in a few words ? It's not the first time I've seen it but I don't really understand (what it is and why it is) – user3548298 Mar 31 '18 at 12:09
  • @user3548298 It's just a (very) general expression, meaning "experiment, change, try & retry different things" etc, and don't feel bounded by the book. Supposedly all (great) inventors are "tinkerers"... ;) – desertnaut Mar 31 '18 at 12:18
  • So does it mean that there is no specific way to model a given problem and I have to brute force it ? Also I just tried your changes for a few runs and the results are very different from run to run. Are you sure you provided the right screenshot associated with it ? – user3548298 Mar 31 '18 at 12:21
  • 1) no, no brute force - the idea of tinkering is trial & experiment 2) screenshot is from my 1st (and only) attempt with the shown code – desertnaut Mar 31 '18 at 12:29
  • @user3548298 have a look at this tweet & linked post about tinkering https://twitter.com/hardmaru/status/980935074416803840 – desertnaut Apr 03 '18 at 09:05
1

A more concise version of your code that works:

def data_gen():
    while True:
        x = (np.random.random([1024])-0.5) * 10 
        y = np.sin(x)
        yield (x,y)

regressor = Sequential()
regressor.add(Dense(units=20, activation='tanh', input_dim=1))
regressor.add(Dense(units=20, activation='tanh'))
regressor.add(Dense(units=20, activation='tanh'))
regressor.add(Dense(units=1, activation='linear'))
regressor.compile(loss='mse', optimizer='adam')

regressor.fit_generator(data_gen(), epochs=3, steps_per_epoch=128)

x = (np.random.random([1024])-0.5)*10
x = np.sort(x)
y = np.sin(x)

plt.plot(x, y)
plt.plot(x, regressor.predict(x))
plt.show()

Changes made: replacing low layer activations with hyperbolic tangents, replacing the static dataset with a random generator, replacing sgd with adam. That said, there still are problems with other parts of your code that I haven't been able to locate yet (most likely your scaler and random process).

KonstantinosKokos
  • 3,369
  • 1
  • 11
  • 21
  • The only problem with my code is about the network (architecture and training). The rest is fine, it's just a small test script – user3548298 Mar 31 '18 at 11:43
  • I have been toying around with your code and making the architecture changes on the original still results in a linear y-pred. I can't figure out what's wrong specifically though. – KonstantinosKokos Mar 31 '18 at 11:46
0

I managed to get a good approximation by changing the architecture and the training as in the following code. It's a bit of an overkill but at least I know where the problem was coming from.

from keras.models import Sequential
from keras.layers import Dense
import matplotlib.pyplot as plt
import random
import numpy
from sklearn.preprocessing import MinMaxScaler
from sklearn.ensemble import ExtraTreesRegressor
from keras import optimizers

regressor = Sequential()
regressor.add(Dense(units=500, activation='sigmoid', kernel_initializer='uniform', input_dim=1))
regressor.add(Dense(units=500, activation='sigmoid', kernel_initializer='uniform'))
regressor.add(Dense(units=1, activation='sigmoid'))
regressor.compile(loss='mean_squared_error', optimizer='adam')
#regressor = ExtraTreesRegressor()

N = 5000

X = numpy.empty((N,))
Y = numpy.empty((N,))

for i in range(N):
    X[i] = random.uniform(-10, 10)

X = numpy.sort(X).reshape(-1, 1)

for i in range(N):
    Y[i] = numpy.sin(X[i])

Y = Y.reshape(-1, 1)

X_scaler = MinMaxScaler()
Y_scaler = MinMaxScaler()
X = X_scaler.fit_transform(X)
Y = Y_scaler.fit_transform(Y)

regressor.fit(X, Y, epochs=50, verbose=1, batch_size=2)
#regressor.fit(X, Y.reshape(5000,))

x = numpy.mgrid[-10:10:100*1j]
x = x.reshape(-1, 1)
y = numpy.mgrid[-10:10:100*1j]
y = y.reshape(-1, 1)
x = X_scaler.fit_transform(x)
for i in range(len(x)):
    y[i] = regressor.predict(numpy.array([x[i]]))


plt.figure()
plt.plot(X_scaler.inverse_transform(x), Y_scaler.inverse_transform(y))
plt.plot(X_scaler.inverse_transform(X), Y_scaler.inverse_transform(Y))

However I'm still baffled that I found papers saying that they were using only two hidden layers of five neurons to approximate the Q function of the mountain car problem and training their network for only a few minutes and get good results. I will try changing my batch size in my original problem to see what results I can get but I'm not very optimistic

user3548298
  • 186
  • 1
  • 1
  • 13