DQN - Q-Loss not converging

Question

I'm using the DQN algorithm to train an agent in my environment, that looks like this:

Agent is controlling a car by picking discrete actions (left, right, up, down)
The goal is to drive at a desired speed without crashing into other cars
The state contains the velocities and positions of the agent's car and the surrounding cars
Rewards: -100 for crashing into other cars, positive reward according to the absolute difference to the desired speed (+50 if driving at desired speed)

I have already adjusted some hyperparameters (network architecture, exploration, learning rate) which gave me some descent results, but still not as good as it should/could be. The rewards per epiode are increasing during training. The Q-values are converging, too (see figure 1). However, for all different settings of hyperparameter the Q-loss is not converging (see figure 2). I assume, that the lacking convergence of the Q-loss might be the limiting factor for better results.

Q-value of one discrete action durnig training

Q-loss during training

I'm using a target network which is updated every 20k timesteps. The Q-loss is calculated as MSE.

Do you have ideas why the Q-loss is not converging? Does the Q-Loss have to converge for DQN algorithm? I'm wondering, why Q-loss is not discussed in most of the papers.

score 17 · Answer 1 · answered Nov 07 '19 at 15:34

Yes, the loss must coverage, because of the loss value means the difference between expected Q value and current Q value. Only when loss value converges, the current approaches optimal Q value. If it diverges, this means your approximation value is less and less accurate.

Maybe you can try adjusting the update frequency of the target network or check the gradient of each update (add gradient clipping). The addition of the target network increases the stability of the Q-learning.

In Deepmind's 2015 Nature paper, it states that:

The second modification to online Q-learning aimed at further improving the stability of our method with neural networks is to use a separate network for generating the traget yj in the Q-learning update. More precisely, every C updates we clone the network Q to obtain a target network Q' and use Q' for generating the Q-learning targets y_j for the following C updates to Q. This modification makes the algorithm more stable compared to standard online Q-learning, where an update that increases Q(s_t,a_t) often also increases Q(s_t+1, a) for all a and hence also increases the target y_j, possibly leading to oscillations or divergence of the policy. Generating the targets using the older set of parameters adds a delay between the time an update to Q is made and the time the update affects the targets y_j, making divergence or oscillations much more unlikely.

Human-level control through deep reinforcement learning, Mnih et al., 2015

I've made an experiment for another person asked similar questions in the Cartpole environment, and the update frequency of 100 solves the problem (achieve a maximum of 200 steps).

When C (update frequency) = 2, Plotting of the average loss:

C = 10

C = 100

C = 1000

C = 10000

If the divergence of loss value is caused by gradient explode, you can clip the gradient. In Deepmind's 2015 DQN, the author clipped the gradient by limiting the value within [-1, 1]. In the other case, the author of Prioritized Experience Replay clip gradient by limiting the norm within 10. Here're the examples:

DQN gradient clipping:

    optimizer.zero_grad()
    loss.backward()
    for param in model.parameters():
        param.grad.data.clamp_(-1, 1)
    optimizer.step()

PER gradient clipping:

    optimizer.zero_grad()
    loss.backward()
    if self.grad_norm_clipping:
       torch.nn.utils.clip_grad.clip_grad_norm_(self.model.parameters(), 10)
   optimizer.step()

What are the x and y axed in these graphs? Is it episodes vs mean loss across all experiences in memory/the replay buffer? — user76284, Jan 28 '20 at 16:57
Yes, as you said. The x is the episodes, and the y is the average loss for the model. — Alexander, Apr 01 '20 at 12:30

score 5 · Answer 2 · answered Sep 05 '18 at 13:57

5

I think it's normal that the Q-loss is not converging as your data keeps changing when your policy updates. It is not the same as supervised learning where your data never changes and you can make multiple passes on your data to make sure your weights are well fitted with that data.

Another thing is I found out that slightly updating the target network at every timestep (soft update) worked better for me than updating it at each X timesteps (hard update).

answered Sep 05 '18 at 13:57

Raphael Royer-Rivard

2,252
1
30
53

1

WIth the DQN exploring a random selection of transitions from replay memory, and as it 'explores' the actionspace, and with epsilon decline, shouldn't there be phases of loss increase as the DQN moves into an 'unlearned' part of the actionspace, and phases of loss decrease, as it learns more, explores more of the actionspace and epsilon->0. Ultimately as the actionspace is completely learned, all losses should then decrease to 0. – MarkD Dec 25 '20 at 21:30

DQN - Q-Loss not converging

2 Answers2