Q-learning using neural networks

Question

I'm trying to implement the Deep q-learning algorithm for a pong game. I've already implemented Q-learning using a table as Q-function. It works very well and learns how to beat the naive AI within 10 minutes. But I can't make it work using neural networks as a Q-function approximator.

I want to know if I am on the right track, so here is a summary of what I am doing:

I'm storing the current state, action taken and reward as current Experience in the replay memory
I'm using a multi layer perceptron as Q-function with 1 hidden layer with 512 hidden units. for the input -> hidden layer I am using a sigmoid activation function. For hidden -> output layer I'm using a linear activation function
A state is represented by the position of both players and the ball, as well as the velocity of the ball. Positions are remapped, to a much smaller state space.
I am using an epsilon-greedy approach for exploring the state space where epsilon gradually goes down to 0.
When learning, a random batch of 32 subsequent experiences is selected. Then I compute the target q-values for all the current state and action Q(s, a).

forall Experience e in batch if e == endOfEpisode target = e.getReward else target = e.getReward + discountFactor*qMaxPostState end

Now I have a set of 32 target Q values, I am training the neural network with those values using batch gradient descent. I am just doing 1 training step. How many should I do?

I am programming in Java and using Encog for the multilayer perceptron implementation. The problem is that training is very slow and performance is very weak. I think I am missing something, but can't figure out what. I would expect at least a somewhat decent result as the table approach has no problems.

score 2 · Answer 1 · answered Mar 26 '18 at 09:06

I'm using a multi layer perceptron as Q-function with 1 hidden layer with 512 hidden units.

Might be too big. Depends on your input / output dimensionality and the problem. Did you try fewer?

Sanity checks

Can the network possibly learn the necessary function?

Collect ground truth input/output. Fit the network in a supervised way. Does it give the desired output?

A common error is to have the last activation function something wrong. Most of the time, you will want a linear activation function (as you have). Then you want the network to be as small as possible, because RL is pretty unstable: You can have 99 runs where it doesn't work and 1 where it works.

Do I explore enough?

Check how much you explore. Maybe you need more exploration, especially in the beginning?

I think I have tried several configurations, but I did not do a run with less than 100 hidden units. Perhaps that would be worth a try? Also, when it comes to working on these kind of problems, how much does hardware matter? Are these kind of smaller problems simple enough to be trained on a GPU, or do I need a high end GPU to even do this particular problem? I think in this case I was training on an Intel I7, in which case, could it just be that I was not training long enough as it just takes too long without a GPU? — SilverTear, Jul 17 '18 at 11:14

score 2 · Answer 2 · answered Mar 06 '19 at 11:43

Try using ReLu (or better Leaky ReLu)-Units in the hidden layer and a Linear-Activision for the output.
Try changing the optimizer, sometimes SGD with propper learning-rate-decay helps. Sometimes ADAM works fine.
Reduce the number of hidden units. It might be just too much.
Adjust the learning rate. The more units you have, the more impact does the learning rate have as the output is the weighted sum of all neurons before.
Try using the local position of the ball meaning: ballY - paddleY. This can help drastically as it reduces the data to: above or below the paddle distinguished by the sign. Remember: if you use the local position, you won't need the players paddle-position and the enemies paddle position must be local too.
Instead of the velocity, you can give it the previous state as an additional input. The network can calculate the difference between those 2 steps.

Q-learning using neural networks

2 Answers2

Sanity checks

See also