1

I have a policy that I read from disk using the function SavedModelPyTFEagerPolicy. For troubleshooting the environment definitions, I would like to examine the predicted value of different states.

I have had success using these instructions to extract the actions from the policy for test cases. Is there a function that will allow me to extract the predicted values associated with those actions?

Setjmp
  • 27,279
  • 27
  • 74
  • 92

1 Answers1

1

Looking at the Tensorflow DQN Agent documentation you hand a q-network to the agent at creation time. This get saved as an instance variable with the name _q_network and can be accessed with agent._q_network. To quote the documentation:

The network will be called with call(observation, step_type) and should emit logits over the action space.

Those logits are your respective state action values.

tnfru
  • 296
  • 1
  • 10
  • If we look up the definition of "Q" function in a textbook, it should be a prediction of reward given a state an action. However, the document you refer to suggests call() emits logits. When I inspect their values, they don't look like rewards. When we create the network in the Deep Q tutorial the final stage of the network has no activation function, and I assume the reason is that we are modelling the value as a regression style problem. So how to access the value prediction conditioned on state and action? – Setjmp Aug 28 '21 at 16:29
  • The Q-function does not predict rewards, but **expected returns** most of the time. The logits emitted are these expected returns for actions. One does not use a activation function because it is a regression. – tnfru Aug 29 '21 at 09:08
  • It also might be predicting advantages if the reason you're confused is that some are negative. Advantage is A(a,s) = Q(s,a) - V(s) – tnfru Aug 30 '21 at 11:48
  • Unfortunately I can't correct reward->expected returns in my comment. In theory your overall approach to accessing the values seems valid. In practice, I found that if I access the q_net vales in the middle of training (as I would like) my network fails to converge. On the other hand, if I save the policy and load from disk, I don't see the q_network attribute in the restored policy object. – Setjmp Aug 31 '21 at 05:06
  • Call your q_net without gradient as to not mess with the training process. I'm not too sure about the loading after the policy, but shouldn't it restore to the same object, hence same instance variables? – tnfru Aug 31 '21 at 17:05
  • 1
    Update... I had a few details wrong in my code leading to some confusion. However, your advice on accessing the network worked once I ironed out those details. Despite what the current version of the docs say, the output of the model looks like the value predictions and not some logit. Thank you for your help, tnfru. – Setjmp Sep 06 '21 at 00:48
  • @Setjmp I found out that the word logit is overloaded and has different meanings in math and ML, you probably want to check this out, it will clarify terminology: https://stackoverflow.com/a/43577384/8098068 – tnfru Sep 16 '21 at 10:33