I am trying to build a temporal difference learning agent for Othello. While the rest of my implementation seems to run as intended I am wondering about the loss function used to train my network. In Sutton's book "Reinforcement learning: An Introduction", the Mean Squared Value Error (MSVE is presented as the standard loss function. It is basically a Mean Square Error multiplied with the on policy distribution. (Sum over all states s ( onPolicyDistribution(s) * [V(s) - V'(s,w)]² ) )
My question is now: How do I obtain this on policy distribution when my policy is an e-greedy function of a learned value function? Is it even necessary and what's the issue if I just use an MSELoss instead?
I'm implementing all of this in pytorch, so bonus points for an easy implementation there :)