3

I am trying to build a temporal difference learning agent for Othello. While the rest of my implementation seems to run as intended I am wondering about the loss function used to train my network. In Sutton's book "Reinforcement learning: An Introduction", the Mean Squared Value Error (MSVE is presented as the standard loss function. It is basically a Mean Square Error multiplied with the on policy distribution. (Sum over all states s ( onPolicyDistribution(s) * [V(s) - V'(s,w)]² ) )

My question is now: How do I obtain this on policy distribution when my policy is an e-greedy function of a learned value function? Is it even necessary and what's the issue if I just use an MSELoss instead?

I'm implementing all of this in pytorch, so bonus points for an easy implementation there :)

masus04
  • 943
  • 8
  • 19

1 Answers1

1

As you mentioned, in your case, it sounds like you are doing Q-learning, so you do not need to do policy gradient as described in Sutton's book. That is need when you are learning a policy. You are not learning a policy, you are learning a value function and using that to act.

patapouf_ai
  • 17,605
  • 13
  • 92
  • 132