11

The difference between Q-learning and SARSA is that Q-learning compares the current state and the best possible next state, whereas SARSA compares the current state against the actual next state.

If a greedy selection policy is used, that is, the action with the highest action value is selected 100% of the time, are SARSA and Q-learning then identical?

nbro
  • 15,395
  • 32
  • 113
  • 196
Mouscellaneous
  • 2,584
  • 3
  • 27
  • 37
  • 1
    To get a better intuition on the similarities between SARSA and Q-Learning, I would suggest looking into Expected-SARSA. It can be shown that Expected-SARSA is equivalent to Q-Learning when using a greedy selection policy. – Andnp Jun 15 '16 at 17:11
  • I’m voting to close this question because it's not a programming issue, but a conceptual question. This type of question should be asked on https://ai.stackexchange.com/. – nbro Apr 03 '22 at 09:51

3 Answers3

17

Well, not actually. A key difference between SARSA and Q-learning is that SARSA is an on-policy algorithm (it follows the policy that is learning) and Q-learning is an off-policy algorithm (it can follow any policy (that fulfills some convergence requirements).

Notice that in the following pseudocode of both algorithms, that SARSA choose a' and s' and then updates the Q-function; while Q-learning first updates the Q-function, and the next action to perform is selected in the next iteration, derived from the updated Q-function and not necessarily equal to the a' selected to update Q.

enter image description here

enter image description here

In any case, both algorithms require exploration (i.e., taking actions different from the greedy action) to converge.

The pseudocode of SARSA and Q-learning have been extracted from Sutton and Barto's book: Reinforcement Learning: An Introduction (HTML version)

Pablo EM
  • 6,190
  • 3
  • 29
  • 37
  • In the pseudocode examples, it is clear that the difference between SARSA and Q-Learning is that the prior is on-policy and the latter is off-policy. However, when the policy is purely greedy (epsilon = 0), are they then not the same? The reason why I ask is because, when using generalization for the Q-values, it seems appropriate to set epsilon to 0 because the generalization algorithm will make mistakes evaluating the actions in each state. – Mouscellaneous Oct 27 '15 at 15:41
  • 6
    In both implementations show above, with epsilon=0, actions are always choosed based on a policy derived from Q. However, Q-learning first updates Q, and it selects the next action based on the updated Q. In the case of SARSA, it chooses the next action and after updates Q. So, I think that they are not equivalent. – Pablo EM Oct 27 '15 at 16:20
  • 1
    In the case of generalization, the fact that the Q function is approximated with some error not ensures that all actions in all states can be choosen with a probability different to 0. Therefore, it is not a good method to introduce exploration. – Pablo EM Oct 27 '15 at 16:26
  • 1
    It is true that SARSA chooses an action before updating the Q-values, and Q-learning chooses afterwards, but the update applies to the previous state s and not s', therefore shouldn't have any effect on the Q-values of the state s' which the action a' is being selected for, unless when using generalization, the update would have an effect on all states. – Mouscellaneous Oct 27 '15 at 17:02
  • 3
    In a stochastic case, even if you don't use generalization, it can have some effects. For example, if the current state and the next state are the same. But maybe, as you point, in some very specific case both algorithms can learng the same policy. – Pablo EM Oct 27 '15 at 17:14
4

If we use only the greedy policy then there will be no exploration so the learning will not work. In the limiting case where epsilon goes to 0 (like 1/t for example), then SARSA and Q-Learning would converge to the optimal policy q*. However with epsilon being fixed, SARSA will converge to the optimal epsilon-greedy policy while Q-Learning will converge to the optimal policy q*.

I write a small note here to explain the differences between the two and hope it can help:

https://tcnguyen.github.io/reinforcement_learning/sarsa_vs_q_learning.html

  • I do not think this is correct, for fixed epsilon, Sutton and Barto, Ch 6, example 6.6 explicitly shows how in their chosen example, the Q-learning approach converges to a policy that is worse, even than the SARSA policy. http://incompleteideas.net/book/the-book-2nd.html – Quantum Sphinx Oct 07 '21 at 12:28
  • @ Quantum Sphinx In that cliff walking example, Q-learning is only worse than SARSA in online performance, due to $\epsilon$. – XXX Oct 25 '21 at 10:27
2

If an optimal policy has already formed, SARSA with pure greedy and Q-learning are same.

However, in training, we only have a policy or sub-optimal policy, SARSA with pure greedy will only converge to the "best" sub-optimal policy available without trying to explore the optimal one, while Q-learning will do, because of enter image description here, which means it tries all actions available and choose the max one.

Alan Yang
  • 66
  • 5