6

While trying to implement the Episodic Semi-gradient Sarsa with a Neural Network as the approximator I wondered how I choose the optimal action based on the currently learned weights of the network. If the action space is discrete I can just calculate the estimated value of the different actions in the current state and choose the one which gives the maximimum. But this seems to be not the best way of solving the problem. Furthermore, it does not work if the action space can be continous (like the acceleration of a self-driving car for example).

So, basicly I am wondering how to solve the 10th line Choose A' as a function of q(S', , w) in this pseudo-code of Sutton: enter image description here

How are these problems typically solved? Can one recommend a good example of this algorithm using Keras?

Edit: Do I need to modify the pseudo-code when using a network as the approximator? So, that I simply minimize the MSE of the prediction of the network and the reward R for example?

zimmerrol
  • 4,872
  • 3
  • 22
  • 41

1 Answers1

4

I wondered how I choose the optimal action based on the currently learned weights of the network

You have three basic choices:

  1. Run the network multiple times, once for each possible value of A' to go with the S' value that you are considering. Take the maximum value as the predicted optimum action (with probability of 1-ε, otherwise choose randomly for ε-greedy policy typically used in SARSA)

  2. Design the network to estimate all action values at once - i.e. to have |A(s)| outputs (perhaps padded to cover "impossible" actions that you need to filter out). This will alter the gradient calculations slightly, there should be zero gradient applied to last layer inactive outputs (i.e. anything not matching the A of (S,A)). Again, just take the maximum valid output as the estimated optimum action. This can be more efficient than running the network multiple times. This is also the approach used by the recent DQN Atari games playing bot, and AlphaGo's policy networks.

  3. Use a policy-gradient method, which works by using samples to estimate gradient that would improve a policy estimator. You can see chapter 13 of Sutton and Barto's second edition of Reinforcement Learning: An Introduction for more details. Policy-gradient methods become attractive for when there are large numbers of possible actions and can cope with continuous action spaces (by making estimates of the distribution function for optimal policy - e.g. choosing mean and standard deviation of a normal distribution, which you can sample from to take your action). You can also combine policy-gradient with a state-value approach in actor-critic methods, which can be more efficient learners than pure policy-gradient approaches.

Note that if your action space is continuous, you don't have to use a policy-gradient method, you could just quantise the action. Also, in some cases, even when actions are in theory continuous, you may find the optimal policy involves only using extreme values (the classic mountain car example falls into this category, the only useful actions are maximum acceleration and maximum backwards acceleration)

Do I need to modify the pseudo-code when using a network as the approximator? So, that I simply minimize the MSE of the prediction of the network and the reward R for example?

No. There is no separate loss function in the pseudocode, such as the MSE you would see used in supervised learning. The error term (often called the TD error) is given by the part in square brackets, and achieves a similar effect. Literally the term ∇q(S,A,w) (sorry for missing hat, no LaTex on SO) means the gradient of the estimator itself - not the gradient of any loss function.

Neil Slater
  • 26,512
  • 6
  • 76
  • 94
  • Okay, thanks. I decided to go with option 1. of your answer and tried to implement this problem using `Keras`. But I still do not really know how to perform the update of the weights. I tried to solve it like this https://gist.github.com/FlashTek/0dfddf46c4d50c4e068f1ecbad1d03b5, but sadly the agent is not really learning anything. Can you give me more details about this step? – zimmerrol Jul 29 '17 at 18:36
  • 1
    I'm not that familiar with how the merge method works in Keras, but I assume you have ended up creating a mini-batch of 3 to predict when selecting an action, and then a mini-batch of 1 for the update. I a problem: `output_layer = Dense(1, activation="tanh")(x)` - I don't think `tanh` covers the full range of rewards, with this task, a reward of e.g. -300 is possible. You appear to have modified the training target (using the TD target, not the TD error), but I think that is a correct modification to go with using the Keras optimiser in a simple way. – Neil Slater Jul 29 '17 at 19:24
  • Yes, that's what I wanted to do. Okay - which other activation function should I use instead of this? I thought that a `LeakyReLu` would solve the problem (as it covers all negative values), but the agent is still not learning anything. – zimmerrol Jul 29 '17 at 19:44
  • 1
    @FlashTek: I think you were close other than the tanh problem, see gist.github.com/neilslater/28004397a544f97b2ff03d25d4ddae52 which I have working (it's a regression problem, so usually you want last layer to be linear). I have simplified the model a little, because it is possible it was unstable, also added the "backstop" set of velocity to zero which is normal in MountainCar. – Neil Slater Jul 29 '17 at 20:08
  • 1
    @FlashTek: I spoke too soon - the implementation I have posted seems unstable - it gets some initial episodes of length ~450, but then starts to diverge. I am not sure why. – Neil Slater Jul 29 '17 at 20:22
  • Your modification does not seem to solve the problem: Sometimes the agents solves the problem fast, but sometiems he almost need an infinite time to get to the top of the mountain. – zimmerrol Jul 29 '17 at 20:29
  • Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/150499/discussion-between-flashtek-and-neil-slater). – zimmerrol Jul 29 '17 at 21:00
  • 1
    @FlashTek: I tried adding experience replay to your code, and this makes a significant improvement. It is still not as fast as the linear tiled features and semi-gradient SARSA(λ) from Sutton's book, but it does seem to converge to a successful policy. I've updated the gist: https://gist.github.com/neilslater/28004397a544f97b2ff03d25d4ddae52 - probably it could be coded more efficiently in terms of array manipulation . . . – Neil Slater Jul 30 '17 at 21:34
  • _"The error term (often called the TD error) is given by the part in square brackets, and achieves a similar effect."_. Using this as Loss does indeed make the algorithm work, but I do not understand why. Can you elaborate a bit on this? Where can I find relevant literature demonstrating that minimizing the TD error (δ) will lead to the same as maximizing Obj. function given: ∇Obj. = δ∇Q(s,a) Thank you. @NeilSlater – Alex Ramalho Jul 19 '22 at 11:19
  • 1
    @AlexandreRamalho Simplest way to view it is think of it being MSE error for the true action values (of the current policy), although in regular SARSA it is more complex than that, because it is not a full gradient but a *semi-gradient* due to ignoring the effect of bootstrapping. A good reference is Sutton&Barto - http://incompleteideas.net/sutton/book/the-book-2nd.html - already linked in the answer. The section on loss functions for TD learning is chapter 11, but warning it is complex. – Neil Slater Jul 19 '22 at 11:29
  • Yeah man I'm trying to make sense of that as of now. Thanks for the very quick answer, really. – Alex Ramalho Jul 19 '22 at 15:28