1

I am trying to plot the overestimation bias of the critics in DDPG and TD3 models. So essentially there is a critic_target and a critic network. I want to understand how does one go about finding the overestimation bias of the critic with the true Q value? and also how to find the true Q value?

I see in the original TD3 paper (https://arxiv.org/pdf/1802.09477.pdf) that the author measures the overestimation bias of the value networks. Can someone guide me in plotting the same during the training phase of my actor-critic model?

Bromine
  • 21
  • 3

1 Answers1

1

Answering my own question: Essentially during the training phase at each evaluation period (example: every 5000 steps) we can call a function to do this which performs as follows. Keep in mind the policy is kept fixed throughout this run.

pseudocode is as follows

import gym

def get_estimation_values(policy,env_name,gamma=0.99):
eval_env = gym.make(env_name)
state,done = eval.env.reset(),False
episode_reward = 0
max_steps=env.max_steps

#for example if there is only one critic like in DDPG
action = policy.actor(state)
estimated_Q = policy.critic(state,action) #This will be the estimated Q value for the starting state s0 

#The true Q value is given by : 
# Q(s0,a) = r_0 + gamma(Q(s1,a1))
# Q(s1,a1) = r_1 + gamma(Q(s2,a2))
# Q(s2,a2) = r_2 + gamma(Q(s3,a3)) and so on

# Therefore the true Q value can be written in this form:
# True_Q_value = r_0 + gamma(r1 + gamma(r2 + gamma(r3 + ....)))
# True_Q_value = r_0 + gamma*r1 + (gamma^2 * r2) + (gamma^3 *r3 ) .... until terminal state

# code to find true Q

true_Q = 0

for timesteps in range(max_steps):

        if(done):
            break


        #take action according to the current policy until done
        action = policy.actor(state) #maybe convert tensor to numpy if required
        next_state,reward,done,_ = eval_env.step(action)
        episode_reward+=0

        true_Q = true_Q+(gamma**t)*reward

return estimated_Q, true_Q
Bromine
  • 21
  • 3
  • for an actionable improvement over the method listed above, you should sample many rollouts and compare the summary statistics like the mean, median, or interquartile mean. I think you will find that the biggest difference is that the variance of `estimated_Q` can be very high – physincubus Jun 22 '23 at 01:42