Why do we need exploitation in RL(Q-Learning) for convergence?

Question

I am implementing Q-learning algorithm and I observed that my Q-values are not converging to optimal Q-values even though the policy seems to be converging. I defined the action selection strategy as epsilon-greedy and epsilon is decreasing by 1/N starting from 1(N being the total number of iterations). That way in the earlier iterations the algorithm explores random states then this rate gradually decreases leading to exploitation. In addition, I defined the learning rate as 1/N_t(s,a) where N_t(s,a) is the total number of times (s,a) is visited.

Everything seems to be correct but since I can't get to the optimal Q-values I started looking into different strategies and in the meantime got super confused. I know that convergence is achieved when all (s,a) pairs are visited infinitely often. Isn't this equivalent to saying all (s,a) pairs are explored many times? In other words, why do we need exploitation for convergence? What if we don't exploit and just focus on exploring? If we do that we search all of the solution space, hence shouldn't that be enough to find an optimal policy?

Also, when its said the Q-values converge to optimal, does only the max_a[Q(s,a)] converge to its optimal value or all Q(s,a) values converge to their optimal value?

Probably there is a simple answer to all of these however even though I checked a lot of resources and similar threads I still couldn't figure out the logic behind exploitation. Thanks a lot for your time in advance!

score 1 · Accepted Answer · answered Mar 29 '18 at 08:15

1

Exploitation indeed isn't necessary for convergence in theory. In practice, it often turns out to be important / necessary for one or both of the following two reasons:

Sometimes we are not learning just for the sake of learning, but we also care about our performance already during the learning/training process. This means we need a balance between exploitation (performing well) and exploration (continuing to learn).
More importantly, if we purely explore and do not exploit at all, this may also limit our ability to learn in practice, because there are many states that we may simply fail to reach if we always act randomly.

To clarify on the second point, consider, for example, that we're in one corner of a large, 2D grid, and our goal position is in the opposite corner. Suppose that we already get small rewards whenever we move closer to the goal, and small negative rewards whenever we move further away. If we have a balance between exploration and exploitation, it is likely that we'll quickly learn to walk along the path from start to goal, but also bounce around that path a bit randomly due to exploration. In other words, we'll start learning what to do in all states around that path.

Now, suppose you try learning in the same situation only by acting randomly (e.g. no exploitation). If we only act randomly in a sufficiently large 2D grid, and we always start in one corner, it's highly unlikely that we'll ever manage to reach the other side of the grid. We'll just randomly keep moving in an area around the starting position, and never learn what to do in states far away from this starting position. It's unlikely to ever reach them with pure random behaviour in practice. Obviously we will reach every state given an infinite amount of time, but we rarely have an infinite amount of time in practice.

answered Mar 29 '18 at 08:15

Dennis Soemers

8,090
2
32
55

This does not seem correct to me. Some methods, such as SARSA, require using the learned policy, i.e. exploit, for converge. See my answer. – Pablo EM Mar 29 '18 at 10:26
@PabloEM On-policy methods (like SARSA) do indeed learn values for the policy that they're following, but theoretically there's no need for such algorithms to follow a policy that involves exploitation. In theory, you could have such an algorithm follow (and learn values for) a completely random policy. In practice, that would often be a bad idea for the reasons described in my answer. – Dennis Soemers Mar 29 '18 at 10:31
The question is about if explotation is required for convergence. As you said, SARSA learn values for the policy that it's following. So, in theory, for convergence in SARSA you must exploit the learned policy. If you have SARSA algorithm following a completely random policy, convergence is never achieved. That it's the reason because I think your answer is incorrect – Pablo EM Mar 29 '18 at 10:43
@PabloEM I don't agree with saying that convergence is never achieved if you have SARSA following a completely random policy. It does converge. It converges to the state-action values that are correct given the knowledge that your agent will continue acting randomly afterwards. So, for every `(S, A)`, it will converge to the expected one-step reward `R` for executing `A` in `S`, plus `gamma` times the expected returns for acting randomly after executing `A` in `S`. I agree that those are not very useful quantities to converge to, but it does converge to something. – Dennis Soemers Mar 29 '18 at 10:52
Apart from that, the question seems to be primarily about Q-learning anyway (though it's good to cover the more general case as well). – Dennis Soemers Mar 29 '18 at 10:54
Yes, as you said, "It converges to the state-action values that are correct given the knowledge that your agent will continue acting randomly afterwards", so it doesn't converge to the optimal policy. When we talk about convergence, usually means convergence to the optimal policy. I'm pretty sure that the question asker is also talking about convergence to optimal values/policy, not convergence to something else. – Pablo EM Mar 29 '18 at 11:09
That depends on how you define "optimal values". If you define them in terms of the greedy policy, then yeah, SARSA with a random policy doesn't converge to them. SARSA with many other kinds of policies (e.g. any `epsilon`-greedy policy with constant `epsilon`) doesn't either. If you define them in terms of the policy being followed, then yes, they all do converge to optimal values for their policy (`epsilon`-greedy with `epsilon = 1.0` in the case of no exploitation). I agree the question doesn't seem to be about this or any other variant of SARSA though, it's about Q-learning. – Dennis Soemers Mar 29 '18 at 11:35
1

A standard, common, and well accepted definition of optimal policy: "There is always at least one policy that is better than or equal to all other policies. This is an optimal policy." I'm pretty sure you understand what an optimal policy is. My point is that some RL algorithms require explotation to achieve the optimal policy, which seems relevant to the original question. And, according to your comments, you seem to agree. So I'm not going to start a discussion about what "optimal policy" means because it's totally off-topic (and because I don't have time! ;-) ). – Pablo EM Mar 29 '18 at 11:57
@PabloEM Those RL algorithms (like SARSA) actually are not guaranteed to converge to that optimal policy at all. If you always keep some amount of exploration, they will instead converge to a different notion of optimality (a policy that "knows of itself" that it will sometimes explore instead). If you eventually reduce exploration to zero, you are not guaranteed an infinite amount of exploration, so you're not guaranteed convergence to anything at all. So in this case I still disagree, SARSA won't converge to that definition of optimality at all, even with exploitation. – Dennis Soemers Mar 29 '18 at 12:03
Sure, probably Rich Sutton & Andrew Barto are wrong in their RL textbook when they say: "SARSA converges with probability 1 to an optimal policy and action-value function as long as all state–action pairs are visited an infinite number of times [...]" (http://incompleteideas.net/book/ebook/node64.html). They will be very grateful of reciving corrections for the 2nd version of the book. So, please tell me/them why the previous afirmation is wrong and how it should be improved. I'll be very happy to see your contribution to that wonderful RL book. – Pablo EM Mar 29 '18 at 16:22
you forgot about the "and the policy converges in the limit to the greedy policy" at the end of that sentence. SARSA only converges to your notion of optimality if you have both 1) infinite exploration and 2) convergence to greedy policy in the limit. I explicitly took those conditions into account when writing my comments above, if you read them carefully you'll see they don't contradict any of this. In practice, it is close to impossible to satisfy both of those conditions. On top of that, if you assume you do converge to the greedy policy, my original answer is completely correct again. – Dennis Soemers Mar 29 '18 at 16:54
No, it is not close to impossible to satisfy the condition of "infinite exploration", it is totally impossible to implement in practice an experiment with infinite exploration. By definition, you will never reach infinite in practice. However, it's a theoretical condition for convergence in many algorithms. So, saying that SARSA is not guaranteed to converge because its convergence proofs are based on asymtotic conditions make no sense. In fact, Q-learning also requires infinte exploration. BTW, it's not "my notion of optimality", it's the standard definition of optimality in this context. – Pablo EM Mar 29 '18 at 17:19
Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/167828/discussion-between-dennis-soemers-and-pablo-em). – Dennis Soemers Mar 29 '18 at 17:25

score 0 · Answer 2 · answered Mar 29 '18 at 09:58

As you already said, from a theoretical point of view, RL methods always requires that all (s,a) pairs are visited infinitely often. However, exploitation stage is only necessary depending on the type of RL algorithm. A key concept relevant to your question is distinguishing between on-policy and off-policy algorithms.

In on-policy algorithms (e.g. SARSA) the agent should interact with the environment using the same policy it is being learned. So, this kind of methods requires using the learned policy (aka exploitation) in order to achieve convergence.

On the contrary, in off-policy algorithms (e.g. Q-learning) the agent can follow any policy while converging to the optimal policy.

Off-policy methods can be very useful in problems where the data of interactions between agent-environment is collected in advance. For example, in a medical problem where you have stored interactions between physician treatment-patient responses, you could apply an off-policy algorithm to learn the optimal treatment. In this case obviously you are not using exploitation because the agent is not interacting the environment after the learning starts.

However, notice that off-policy methods can be also employed using explotation, although it's should be clear that this is not a requirement. In most typical RL problems, the goal is the agent chooses right actions as soon as possible. In such a case, make sense to start balancing between exploration-explotation just after learning starts, independenlty if the algorithm is on-policy or off-policy.

Thanks a lot for the explanation! I have a follow up question. Can we say that after convergence is achieved every Q-value (for all the state action pairs) converge to their optimal value? So for instance the optimal action for state 0 is 1. But can we say that Q(0,2) is the best value to be achieved if you take action 2 from state 0? Thank you so much again! — Aybike, Mar 30 '18 at 23:02
Welcome! Sorry, I forgot to answer that part of your question. Yes, when all state-action pairs have been visited enough times to achieve convergence, all Q-values converge to its optimal value, not only max_a[Q(s,a)]. So, in your example, effectively you can say Q(0,2) is the best value of taking action 2 in state 0. — Pablo EM, Mar 31 '18 at 09:02

Why do we need exploitation in RL(Q-Learning) for convergence?

2 Answers2