What are Target Network in Policy Gradient algorithms in Reinforcement learning in simple terms with some example?

Question

How does it differ from regular network Source Text --> "In DDPG algorithm topology consist of two copies of network weights for each network, (Actor: regular and target) and (Critic: regular and target)"

Andrei Poehlmann · Answer 1 · 2020-02-02T17:40:28.580

Sorry, but I'm afraid you have to look a bit at the math of the DDPG algorithm here to understand why it is calld "target network". DDPG minimizes the following loss (from the original paper https://arxiv.org/pdf/1509.02971.pdf):

where Q is represented by your neural network aka. your "agent" and y is the so-called target. It is called target, because you want the values of you agent to be close to it. Just for clarification: Q(s_t, a_t | theta) corresponds to the output of your agent at time step t, given state s, action a and network weights theta.

However, as you can see, the target y depends on the same (neural network) parameters theta of your agent. In practice, this dependency leads to instabilities when minimizing the above loss.

One trick to mitigate this problems is to use a "second" target network, where the target network is either

a frozen state of the agent ("regular") network and just copied over from the regular network every some-fixed-number of steps (e.g. every 10,000 iterations). This is the approach taken in DQN.
or a lagged version of the actual agent ("regular") network, where the lagging is achieved via so-called polyak averaging. That is, instead of updating the weights of your target network by just copying the ones of regular network, at each iteration you take some sort of weighted average. This is the approach taken in DDPG.

So simply put, the target network is nothing else than just a lagged version of the regular network.

What are Target Network in Policy Gradient algorithms in Reinforcement learning in simple terms with some example?

1 Answers1