I'm currently learning about Policy Gradient Descent in the context of Reinforcement Learning. TL;DR, my question is: "What are the constraints on the reward function (in theory and practice) and what would be a good reward function for the case below?"
Details:
I want to implement a Neural Net which should learn to play a simple board game using Policy Gradient Descent. I'll omit the details of the NN as they don't matter. The loss function for Policy Gradient Descent, as I understand it is negative log likelihood: loss = - avg(r * log(p))
My question now is how to define the reward r
? Since the game can have 3 different outcomes: win, loss, or draw - it seems rewarding 1 for a win, 0 for a draw, -1 for a loss (and some discounted value of those for action leading to those outcomes) would be a natural choice.
However, mathematically I have doubts:
Win Reward: 1 - This seems to make sense. This should push probabilities towards 1 for moves involved in wins with diminishing gradient the closer the probability gets to 1.
Draw Reward: 0 - This does not seem to make sense. This would just cancel out any probabilities in the equation and no learning should be possible (as the gradient should always be 0).
Loss Reward: -1 - This should kind of work. It should push probabilities towards 0 for moves involved in losses. However, I'm concerned about the asymmetry of the gradient compared to the win case. The closer to 0 the probability gets, the steeper the gradient gets. I'm concerned that this would create an extremely strong bias towards a policy that avoids losses - to the degree where the win signal doesn't matter much at all.