Implementing a loss function (MSVE) in Reinforcement learning

Question

I am trying to build a temporal difference learning agent for Othello. While the rest of my implementation seems to run as intended I am wondering about the loss function used to train my network. In Sutton's book "Reinforcement learning: An Introduction", the Mean Squared Value Error (MSVE is presented as the standard loss function. It is basically a Mean Square Error multiplied with the on policy distribution. (Sum over all states s ( onPolicyDistribution(s) * [V(s) - V'(s,w)]² ) )

My question is now: How do I obtain this on policy distribution when my policy is an e-greedy function of a learned value function? Is it even necessary and what's the issue if I just use an MSELoss instead?

I'm implementing all of this in pytorch, so bonus points for an easy implementation there :)

score 1 · Answer 1 · answered Feb 26 '18 at 08:22

As you mentioned, in your case, it sounds like you are doing Q-learning, so you do not need to do policy gradient as described in Sutton's book. That is need when you are learning a policy. You are not learning a policy, you are learning a value function and using that to act.

Implementing a loss function (MSVE) in Reinforcement learning

1 Answers1