5

I'm trying to implement the Deep q-learning algorithm for a pong game. I've already implemented Q-learning using a table as Q-function. It works very well and learns how to beat the naive AI within 10 minutes. But I can't make it work using neural networks as a Q-function approximator.

I want to know if I am on the right track, so here is a summary of what I am doing:

  • I'm storing the current state, action taken and reward as current Experience in the replay memory
  • I'm using a multi layer perceptron as Q-function with 1 hidden layer with 512 hidden units. for the input -> hidden layer I am using a sigmoid activation function. For hidden -> output layer I'm using a linear activation function
  • A state is represented by the position of both players and the ball, as well as the velocity of the ball. Positions are remapped, to a much smaller state space.
  • I am using an epsilon-greedy approach for exploring the state space where epsilon gradually goes down to 0.
  • When learning, a random batch of 32 subsequent experiences is selected. Then I compute the target q-values for all the current state and action Q(s, a).

    forall Experience e in batch if e == endOfEpisode target = e.getReward else target = e.getReward + discountFactor*qMaxPostState end

Now I have a set of 32 target Q values, I am training the neural network with those values using batch gradient descent. I am just doing 1 training step. How many should I do?

I am programming in Java and using Encog for the multilayer perceptron implementation. The problem is that training is very slow and performance is very weak. I think I am missing something, but can't figure out what. I would expect at least a somewhat decent result as the table approach has no problems.

SilverTear
  • 695
  • 7
  • 18

2 Answers2

2

I'm using a multi layer perceptron as Q-function with 1 hidden layer with 512 hidden units.

Might be too big. Depends on your input / output dimensionality and the problem. Did you try fewer?

Sanity checks

Can the network possibly learn the necessary function?

Collect ground truth input/output. Fit the network in a supervised way. Does it give the desired output?

A common error is to have the last activation function something wrong. Most of the time, you will want a linear activation function (as you have). Then you want the network to be as small as possible, because RL is pretty unstable: You can have 99 runs where it doesn't work and 1 where it works.

Do I explore enough?

Check how much you explore. Maybe you need more exploration, especially in the beginning?

See also

Martin Thoma
  • 124,992
  • 159
  • 614
  • 958
  • I think I have tried several configurations, but I did not do a run with less than 100 hidden units. Perhaps that would be worth a try? Also, when it comes to working on these kind of problems, how much does hardware matter? Are these kind of smaller problems simple enough to be trained on a GPU, or do I need a high end GPU to even do this particular problem? I think in this case I was training on an Intel I7, in which case, could it just be that I was not training long enough as it just takes too long without a GPU? – SilverTear Jul 17 '18 at 11:14
2
  • Try using ReLu (or better Leaky ReLu)-Units in the hidden layer and a Linear-Activision for the output.
  • Try changing the optimizer, sometimes SGD with propper learning-rate-decay helps. Sometimes ADAM works fine.
  • Reduce the number of hidden units. It might be just too much.
  • Adjust the learning rate. The more units you have, the more impact does the learning rate have as the output is the weighted sum of all neurons before.
  • Try using the local position of the ball meaning: ballY - paddleY. This can help drastically as it reduces the data to: above or below the paddle distinguished by the sign. Remember: if you use the local position, you won't need the players paddle-position and the enemies paddle position must be local too.
  • Instead of the velocity, you can give it the previous state as an additional input. The network can calculate the difference between those 2 steps.