What is the difference between the `policy` and `collect_policy` of a tf-agent?

Question

I am looking at tf-agents to learn about reinforcement learning. I am following this tutorial. There is a different policy used, called collect_policy for training than for evaluation (policy).

The tutorial states there is a difference, but in IMO it does not describe the why of having 2 policies as it does not describe a functional difference.

Agents contain two policies:

agent.policy — The main policy that is used for evaluation and deployment.

agent.collect_policy — A second policy that is used for data collection.

I've looked at the source code of the agent. It says

policy: An instance of tf_policy.Base representing the Agent's current policy.

collect_policy: An instance of tf_policy.Base representing the Agent's current data collection policy (used to set self.step_spec).

But I do not see self.step_spec anywhere in the source file. The next closest thing I find is time_step_spec. But that is the first ctor argument of the TFAgent class, so that makes no sense to set via a collect_policy.

So the only thing I can think of was: put it to the test. So I used policy instead of collect_policy for training. And the agent reached the max score in the environment nonetheless.

So what is the functional difference between the two policies?

[Artificial Intelligence Stack Exchange](https://ai.stackexchange.com/) is probably a better place to ask theoretical questions related to reinforcement learning, so, next time, if you have a conceptual/theoretical questions about RL topics, ask it there. — nbro, Oct 27 '20 at 09:59

score 3 · Answer 1 · edited Dec 12 '20 at 12:06

There are some reinforcement learning algorithms, such as Q-learning, that use a policy to behave in (or interact with) the environment to collect experience, which is different than the policy they are trying to learn (sometimes known as the target policy). These algorithms are known as off-policy algorithms. An algorithm that is not off-policy is known as on-policy (i.e. the behaviour policy is the same as the target policy). An example of an on-policy algorithm is SARSA. That's why we have both policy and collect_policy in TF-Agents, i.e., in general, the behavioural policy can be different than the target policy (though this may not always be the case).

Why should this be the case? Because during learning and interaction with environment, you need to explore the environment (i.e. take random actions), while, once you have learned the near-optimal policy, you may not need to explore anymore and can just take the near-optimal action (I say near-optimal rather than optimal because you may not have learned the optimal one)

I am looking to see if I can train the agent using historical data, manually setting the actions for the observations from environment. Do you think if I use the agent.collect_policy (since it is for data collection in the environment), I can manually set the actions (instead of agent provided actions) for the observations from the environment. I am assuming in the above, you are referring to the collect_policy as behavioral policy. Here is the question I posted related to this. https://stackoverflow.com/questions/72089305/training-agent-using-historical-data-in-tf-agents thank you — tjt, May 02 '22 at 16:19

What is the difference between the `policy` and `collect_policy` of a tf-agent?

1 Answers1