2

I'm currently learning about Policy Gradient Descent in the context of Reinforcement Learning. TL;DR, my question is: "What are the constraints on the reward function (in theory and practice) and what would be a good reward function for the case below?"

Details: I want to implement a Neural Net which should learn to play a simple board game using Policy Gradient Descent. I'll omit the details of the NN as they don't matter. The loss function for Policy Gradient Descent, as I understand it is negative log likelihood: loss = - avg(r * log(p))

My question now is how to define the reward r? Since the game can have 3 different outcomes: win, loss, or draw - it seems rewarding 1 for a win, 0 for a draw, -1 for a loss (and some discounted value of those for action leading to those outcomes) would be a natural choice.

However, mathematically I have doubts:

Win Reward: 1 - This seems to make sense. This should push probabilities towards 1 for moves involved in wins with diminishing gradient the closer the probability gets to 1.

Draw Reward: 0 - This does not seem to make sense. This would just cancel out any probabilities in the equation and no learning should be possible (as the gradient should always be 0).

Loss Reward: -1 - This should kind of work. It should push probabilities towards 0 for moves involved in losses. However, I'm concerned about the asymmetry of the gradient compared to the win case. The closer to 0 the probability gets, the steeper the gradient gets. I'm concerned that this would create an extremely strong bias towards a policy that avoids losses - to the degree where the win signal doesn't matter much at all.

Carsten
  • 4,204
  • 4
  • 32
  • 49

1 Answers1

1

You are on the right track. However, I believe you are confusing rewards with action probabilities. In case of draw, it learns that the reward itself is zero at the end of the episode. However, in case of loss, the loss function is discounted reward (which should be -1) times the action probabilities. So it will get you more towards actions which end in win and away from loss with actions ending in draw falling in the middle. Intuitively, it is very similar to supervised deep learning only with an additional weighting parameter (reward) attached to it.

Additionally, I believe this paper from Google DeepMind would be useful for you: https://arxiv.org/abs/1712.01815. They actually talk about solving the chess problem using RL.

shunyo
  • 1,277
  • 15
  • 32
  • Thanks for your reply. I glossed over a few bits in my question as it was already getting quite long. In practice, I do use discounted rewards, however my fundamental concern remains: The discounted reward for 0 is still always zero which gets multiplied with the probabilities effectively removing them from the equation. – Carsten Jul 01 '18 at 01:32
  • The win and loss is still always positive and negative resulting in a very asymmetric gradient due to the minimum of a positive log being 0 with diminishing gradient and the minimum of a negative log being -inf with an increasingly steep gradient. I'm not quite sure I understand your comparison with supervised learning. In SL, as I understand it, you minimize a metric, which has a natural minimum at 0, whereas in this case you minimize a log which, when going negative can go all the way to -inf for a negative reward. – Carsten Jul 01 '18 at 01:32
  • How can that happen? You are taking log of probabilities multiplied with rewards. So the max is `log(1)*r` while minimum is `log(0)*r`. So unless you are taking the rewards as `-inf`, there is no way to get it. Also, the loss is average of `log(prob)*rewards`. – shunyo Jul 02 '18 at 17:52
  • 1
    This lecture notes explain the similarity between SL and policy gradient RL: http://rail.eecs.berkeley.edu/deeprlcourse-fa17/f17docs/lecture_4_policy_gradient.pdf – shunyo Jul 02 '18 at 17:53
  • re how this can happen: log(0) is -infinity. unless the reward is negative, e.g. -1, when this turns to +infinity. The minus in front of it all in the loss formula turns it around again, so the minimum for any negative reward is probability of 0 with a "loss" of -infinity, no? For reward 0, log(0)*0 = log(1)*0 = log(0

    – Carsten Jul 03 '18 at 01:58
  • Yes, but action probability being 0 (definitely not happening) or 1 (definitely happening) is impossible to get with a stochastic system IMO. I just wrote it down to explain the bounds. If you are still unsure, you can create a simple policy gradient network and run a simple RL problem. I would wager you would not get the infinity problem. Atleast I have not seen it in my work. – shunyo Jul 09 '18 at 20:42