Sarsa and Q Learning (reinforcement learning) don't converge optimal policy

Question

I have a question about my own project for testing reinforcement learning technique. First let me explain you the purpose. I have an agent which can take 4 actions during 8 steps. At the end of this eight steps, the agent can be in 5 possible victory states. The goal is to find the minimum cost. To access of this 5 victories (with different cost value: 50, 50, 0, 40, 60), the agent don't take the same path (like a graph). The blue states are the fail states (sorry for quality) and the episode is stopped.

enter image description here

The real good path is: DCCBBAD

Now my question, I don't understand why in SARSA & Q-Learning (mainly in Q learning), the agent find a path but not the optimal one after 100 000 iterations (always: DACBBAD/DACBBCD). Sometime when I compute again, the agent falls in the good path (DCCBBAD). So I would like to understand why sometime the agent find it and why sometime not. And there is a way to look at in order to stabilize my agent?

Thank you a lot,

Tanguy

Since these algorithms are inherently stochastic, of course there is a chance that it might not arrive at the optimal policy, given any number of iterations. As the number of iterations increases, the probability of this happening becomes smaller and smaller. Since this is exactly what you're experiencing (sometimes optimal, sometimes not), perhaps 100.000 iterations isn't enough to ensure convergence with a reasonably high probability. Does dialing it up to 200k do the trick? Finding that sweet spot is not exact science. — torbonde, Oct 11 '18 at 08:35
Yeah!! I've increased at 200k iterations, change my epsilon greedy equal to 0,5/t (with t += 1e-2), and I forced my first 10000 iterations with a more exploration behavior. Now, every tests converged throughout the optimum. Thank a lot — T.L, Oct 11 '18 at 11:23

score 1 · Answer 1 · answered Oct 11 '18 at 19:06

TD;DR;

Set your epsilon so that you explore a bunch for a large number of episodes. E.g. Linearly decaying from 1.0 to 0.1.
Set your learning rate to a small constant value, such as 0.1.
Don't stop your algorithm based on number of episodes but on changes to the action-value function.

More detailed version:

Q-learning is only garranteed to converge under the following conditions:

You must visit all state and action pairs infinitely ofter.
The sum of all the learning rates for all timesteps must be infinite, so
The sum of the square of all the learning rates for all timesteps must be finite, that is

To hit 1, just make sure your epsilon is not decaying to a low value too early. Make it decay very very slowly and perhaps never all the way to 0. You can try epsilon , too. To hit 2 and 3, you must ensure you take care of 1, so that you collect infinite learning rates, but also pick your learning rate so that its square is finite. That basically means =< 1. If your environment is deterministic you should try 1. Deterministic environment here that means when taking an action a in a state s you transition to state s' for all states and actions in your environment. If your environment is stochastic, you can try a low number, such as 0.05-0.3.

Maybe checkout https://youtu.be/wZyJ66_u4TI?t=2790 for more info.

Sarsa and Q Learning (reinforcement learning) don't converge optimal policy

1 Answers1