I am looking at tf-agents to learn about reinforcement learning. I am following this tutorial. There is a different policy used, called collect_policy
for training than for evaluation (policy
).
The tutorial states there is a difference, but in IMO it does not describe the why of having 2 policies as it does not describe a functional difference.
Agents contain two policies:
agent.policy — The main policy that is used for evaluation and deployment.
agent.collect_policy — A second policy that is used for data collection.
I've looked at the source code of the agent. It says
policy: An instance of
tf_policy.Base
representing the Agent's current policy.collect_policy: An instance of
tf_policy.Base
representing the Agent's current data collection policy (used to setself.step_spec
).
But I do not see self.step_spec
anywhere in the source file. The next closest thing I find is time_step_spec
. But that is the first ctor argument of the TFAgent
class, so that makes no sense to set via a collect_policy
.
So the only thing I can think of was: put it to the test. So I used policy
instead of collect_policy
for training. And the agent reached the max score in the environment nonetheless.
So what is the functional difference between the two policies?