146

Although I know that SARSA is on-policy while Q-learning is off-policy, when looking at their formulas it's hard (to me) to see any difference between these two algorithms.

According to the book Reinforcement Learning: An Introduction (by Sutton and Barto). In the SARSA algorithm, given a policy, the corresponding action-value function Q (in the state s and action a, at timestep t), i.e. Q(st, at), can be updated as follows

Q(st, at) = Q(st, at) + α*(rt + γ*Q(st+1, at+1) - Q(st, at))

On the other hand, the update step for the Q-learning algorithm is the following

Q(st, at) = Q(st, at) + α*(rt + γ*maxa Q(st+1, a) - Q(st, at))

which can also be written as

Q(st, at) = (1 - α) * Q(st, at) + α * (rt + γ*maxa Q(st+1, a))

where γ (gamma) is the discount factor and rt is the reward received from the environment at timestep t.

Is the difference between these two algorithms the fact that SARSA only looks up the next policy value while Q-learning looks up the next maximum policy value?

TLDR (and my own answer)

Thanks to all those answering this question since I first asked it. I've made a github repo playing with Q-Learning and empirically understood what the difference is. It all amounts to how you select your next best action, which from an algorithmic standpoint can be a mean, max or best action depending on how you chose to implement it.

The other main difference is when this selection is happening (e.g., online vs offline) and how/why that affects learning. If you are reading this in 2019 and are more of a hands-on person, playing with a RL toy problem is probably the best way to understand the differences.

One last important note is that both Suton & Barto as well as Wikipedia often have mixed, confusing or wrong formulaic representations with regards to the next state best/max action and reward:

r(t+1)

is in fact

r(t)

starball
  • 20,030
  • 7
  • 43
  • 238
Ælex
  • 14,432
  • 20
  • 88
  • 129

8 Answers8

137

When I was learning this part, I found it very confusing too, so I put together the two pseudo-codes from R.Sutton and A.G.Barto hoping to make the difference clearer.

enter image description here

Blue boxes highlight the part where the two algorithms actually differ. Numbers highlight the more detailed difference to be explained later.

TL;NR:

|             | SARSA | Q-learning |
|:-----------:|:-----:|:----------:|
| Choosing A' |   π   |      π     |
| Updating Q  |   π   |      μ     |

where π is a ε-greedy policy (e.g. ε > 0 with exploration), and μ is a greedy policy (e.g. ε == 0, NO exploration).

  1. Given that Q-learning is using different policies for choosing next action A' and updating Q. In other words, it is trying to evaluate π while following another policy μ, so it's an off-policy algorithm.

  2. In contrast, SARSA uses π all the time, hence it is an on-policy algorithm.

More detailed explanation:

  1. The most important difference between the two is how Q is updated after each action. SARSA uses the Q' following a ε-greedy policy exactly, as A' is drawn from it. In contrast, Q-learning uses the maximum Q' over all possible actions for the next step. This makes it look like following a greedy policy with ε=0, i.e. NO exploration in this part.

  2. However, when actually taking an action, Q-learning still uses the action taken from a ε-greedy policy. This is why "Choose A ..." is inside the repeat loop.

  3. Following the loop logic in Q-learning, A' is still from the ε-greedy policy.

zyxue
  • 7,904
  • 5
  • 48
  • 74
  • 11
    Congratulations for the beautiful graphics and pics. Years after I asked this question I came to realise that the state and action iteration, and the policy value iteration and update, are two different processes. Sadly, Sutton and Barto don't make this very clear. How you decide on actions affects the algorithms as you explained. Max action in Q-Learning usually implies choosing the action with next best Q(s,a) e.g., greedy. In Sarsa this isn't the case, you either follow the policy (on-line) or you explore a new one depending on a random probability. Your description is spot on! – Ælex Jan 03 '17 at 16:19
  • @SilentCrash, no, it's evaluating π. μ is the greedy policy, just for selecting an action. – zyxue May 06 '18 at 01:00
  • 2
    @zyxue But in the table you wrote that it updates Q as if it was following μ (evaluates μ) while actually following ε-greedy policy π. – SilentCrash May 07 '18 at 06:05
  • Can the off-policy method choose A' from human behavior (π) and update Q from a greedy policy (μ)? – Robert Sep 18 '18 at 09:51
  • @Robert, not sure of your question, π is the policy to be learned here, if it's already defined (e.g. whatever human behavior is), then what's to be learned? – zyxue Sep 18 '18 at 21:53
  • @zyxue, I can't get it either. Is Q the target to be learned here? π and μ are given policies? – Robert Sep 19 '18 at 03:30
  • @zyxue, according to David Silver's [slide](http://www0.cs.ucl.ac.uk/staff/d.silver/web/Teaching_files/control.pdf) (page31). One of the advantages of off-policy control is that it can use human behavior to generate the next action (A_t+1). – Robert Sep 20 '18 at 03:22
  • @Robert, Q-learning is just one of the off-policy learning algorithms. There could be other off-policy algorithm, e.g. involving human behaviour s. – zyxue Sep 20 '18 at 05:26
  • @Robert, yes, in David Silver's slide, he uses different notations for target policy and behavior policy. – StayFoolish Nov 02 '18 at 03:19
  • 1
    Another point I want to make is, although in choosing the next action, both SARSA and Q-learning use epsilon-greedy policy, if all Q values are the same, they should be choosing the same action if ignoring the random parts in epsilon-greedy. However, the Q values will become more different at some point during learning because the update equation is different for SARSA and Q-learning, thus they might end up choosing different actions even if using the same epsilon-greedy policy improvement strategy. In another word, the iterated policy will become different. – StayFoolish Nov 02 '18 at 03:22
  • So, if Sarsa's policy es greedy, it is equivalent to Q_learning? – Hermes Morales May 10 '21 at 12:51
78

Yes, this is the only difference. On-policy SARSA learns action values relative to the policy it follows, while off-policy Q-Learning does it relative to the greedy policy. Under some common conditions, they both converge to the real value function, but at different rates. Q-Learning tends to converge a little slower, but has the capabilitiy to continue learning while changing policies. Also, Q-Learning is not guaranteed to converge when combined with linear approximation.

In practical terms, under the ε-greedy policy, Q-Learning computes the difference between Q(s,a) and the maximum action value, while SARSA computes the difference between Q(s,a) and the weighted sum of the average action value and the maximum:

Q-Learning: Q(st+1,at+1) = maxaQ(st+1,a)

SARSA: Q(st+1,at+1) = ε·meanaQ(st+1,a) + (1-ε)·maxaQ(st+1,a)

Don Reba
  • 13,814
  • 3
  • 48
  • 61
  • 4
    Ok, so how does Sarsa then choose a Policy ? I see that Qlearning will always go after the policy that promises it's action to take you to the next best Policy. What are the criteria for selecting the next Policy in Sarsa (basically what I want to know is how to evaluate for a Policy Q(S,A) how to choose the best action ). Isn't it the same, i.e. choosing for State S, the action A, which will has the highest (i.e. max) Q'(S,A) ? – Ælex Jul 28 '11 at 11:40
  • 9
    The policy is the rule for selecting the next action. It is something you need to choose when implementing the algorithm. The simplest policy is the greedy one — where the agent always chooses the best action. With this policy, SARSA and Q-Learning are the same. A better choice for learning is the ε-greedy policy, where some of the actions are chosen at random. – Don Reba Jul 28 '11 at 11:59
  • 2
    Ok, that is why I asked the question in the first place, in this case they both are the same. Thank you very much ! I am using e-Greedy. So Qlearning only differs in the case of Off-Policy, where actions are chosen randomly yet updating with Q-learning maximizes Policy values ? – Ælex Jul 28 '11 at 12:04
  • 2
    Under the ε-greedy policy, the expected value under SARSA is the weighted sum of the average action value and the best action value: Q(s_t+1,a_t+1)=ε·mean(Q(s,a))+(1-ε)·max(Q(s,a)). The textbook gives it in chapter 5.4 On-Policy Monte Carlo Control. – Don Reba Jul 28 '11 at 12:40
67

What is the difference mathematically?

As is already described in most other answers, the difference between the two updates mathematically is indeed that, when updating the Q-value for a state-action pair (St, At):

  • Sarsa uses the behaviour policy (meaning, the policy used by the agent to generate experience in the environment, which is typically epsilon-greedy) to select an additional action At+1, and then uses Q(St+1, At+1) (discounted by gamma) as expected future returns in the computation of the update target.
  • Q-learning does not use the behaviour policy to select an additional action At+1. Instead, it estimates the expected future returns in the update rule as maxA Q(St+1, A). The max operator used here can be viewed as "following" the completely greedy policy. The agent is not actually following the greedy policy though; it only says, in the update rule, "suppose that I would start following the greedy policy from now on, what would my expected future returns be then?".

What does this mean intuitively?

As mentioned in other answers, the difference described above means, using technical terminology, that Sarsa is an on-policy learning algorithm, and Q-learning is an off-policy learning algorithm.

In the limit (given an infinite amount of time to generate experience and learn), and under some additional assumptions, this means that Sarsa and Q-learning converge to different solutions / "optimal" policies:

  • Sarsa will converge to a solution that is optimal under the assumption that we keep following the same policy that was used to generate the experience. This will often be a policy with some element of (rather "stupid") randomness, like epsilon-greedy, because otherwise we are unable to guarantee that we'll converge to anything at all.
  • Q-Learning will converge to a solution that is optimal under the assumption that, after generating experience and training, we switch over to the greedy policy.

When to use which algorithm?

An algorithm like Sarsa is typically preferable in situations where we care about the agent's performance during the process of learning / generating experience. Consider, for example, that the agent is an expensive robot that will break if it falls down a cliff. We'd rather not have it fall down too often during the learning process, because it is expensive. Therefore, we care about its performance during the learning process. However, we also know that we need it to act randomly sometimes (e.g. epsilon-greedy). This means that it is highly dangerous for the robot to be walking alongside the cliff, because it may decide to act randomly (with probability epsilon) and fall down. So, we'd prefer it to quickly learn that it's dangerous to be close to the cliff; even if a greedy policy would be able to walk right alongside it without falling, we know that we're following an epsilon-greedy policy with randomness, and we care about optimizing our performance given that we know that we'll be stupid sometimes. This is a situation where Sarsa would be preferable.

An algorithm like Q-learning would be preferable in situations where we do not care about the agent's performance during the training process, but we just want it to learn an optimal greedy policy that we'll switch to eventually. Consider, for example, that we play a few practice games (where we don't mind losing due to randomness sometimes), and afterwards play an important tournament (where we'll stop learning and switch over from epsilon-greedy to the greedy policy). This is where Q-learning would be better.

Dennis Soemers
  • 8,090
  • 2
  • 32
  • 55
  • 6
    This is absolutely the best explanation policy regardless of algorithms – Ege Jun 13 '20 at 13:46
  • 2
    That's a particularly good answer, and should be the accepted one imho – Scrimbibete Oct 10 '21 at 12:23
  • 1
    Just to compare Apples with Apples, assume we designed a robot that is very cheap and we can afford for it to break and fall innumerable times. We can now use Q-learning here instead of Sarsa. The robot keeps falling and breaking its leg because the Q values don't reflect reality (ie the policy we are actually following). Had it been Sarsa the system would have immediately realized that it is dangerous to walk along the cliff as Q-values are updated according to the policy being followed. In Q learning the Q-values are being calculated as per a diff policy (one that we may eventually adopt) – Allohvk Sep 20 '22 at 06:25
  • 1
    @Allohvk Indeed. From that point of view, Sarsa may have advantages (in terms of learning speed) and be preferable even if we don't care about our agent's performance during the process of learning. Although, in a way, we can think of this also as caring about the agent's performance during learning. Not because it's an expensive robot, but because an agent that performs well during learning may collect more valuable experience and learn more efficiently. – Dennis Soemers Sep 20 '22 at 08:41
  • @DennisSoemers from what I have seen in most of the analysis the metric of interest is cumulative regret. So most ppl actually care about the performance during training right? – Sam Jan 16 '23 at 13:28
  • 1
    @Sam I'd say that both situations where we do care about performance during training (directly, or indirectly due to side effects as mentioned in previous comment), and situations where do not, can be common. Really wouldn't feel confident saying that either of them is "most cases". Not-caring would probably be more common if we can train "in simulation" (e.g., games, robot simulators, ...), and caring would be more common if training on physical robots, or continuing to train online in a system that is also already deployed and running live (e.g., serving ads, catching fraud, ...). – Dennis Soemers Jan 16 '23 at 15:58
  • @DennisSoemers I see. Could you recommend a paper where the analysis is based on simple regret (as opposed to cumulative regret)? – Sam Jan 17 '23 at 04:36
  • 1
    @Sam I'm not extremely familiar with much of the more theoretical work that would typically do analysis on any form of regret. Though, if I simply google "simple regret", there are some results that look relevant. But really in probably the majority of RL papers with empirical experiments, the performance of algorithms is measured on separate "evaluation runs", not on the episodes played for training. Even when they have plots with training time on x-axis, they typically actually take checkpoints along that axis for separate evaluation runs. – Dennis Soemers Jan 17 '23 at 09:55
5

There's an index mistake in your formula for Q-Learning. Page 148 of Sutton and Barto's.

Q(st,at) <-- Q(st,at) + alpha * [r(t+1) + gamma * max Q(st+1,a) - Q(st,at) ]

The typo is in the argument of the max:

the indexes are st+1 and a, while in your question they are st+1 and at+1 (these are correct for SARSA).

Hope this helps a bit.

Alvin
  • 383
  • 5
  • 16
1

In Q-Learning

This is your: Q-Learning: Q(St,At) = Q(St,At) + a [ R(t+1) + discount * max Q(St+1,At) - Q(St,At) ]

should be changed to Q-Learning: Q(St,At) = Q(St,At) + a [ R(t+1) + discount * max Q(St+1,a) - Q(St,At) ]

As you said, you have to find the maximum Q-value for the update eq. by changing the a, Then you will have a new Q(St,At). CAREFULLY, the a that give you the maximum Q-value is not the next action. At this stage, you only know the next state (St+1), and before going to next round, you want to update the St by the St+1 (St <-- St+1).

For each loop;

  • choose At from the St using the Q-value

  • take At and observe Rt+1 and St+1

  • Update Q-value using the eq.

  • St <-- St+1

Until St is terminal

comx
  • 21
  • 1
  • Actually, they have confused the audience; it isn't R[t+1] it is R[t], but they do indeed show it as R[t+1] at one point in the book. However (and don't take my word for it, try it yourself) if you set R[t+1] the reward values do not scale between 0 - 1, and even worse you run into algorithm iterations problems, since Q[t] = R[t] when the state is terminal, which will never be true if using R[t+1]. Wikipedia had it wrong (I've edited it) and Sutton and Barto use the two variations in the book without really explaining why. – Ælex Dec 04 '17 at 11:40
1

The only difference between SARSA and Qlearning is that SARSA takes the next action based on the current policy while qlearning takes the action with maximum utility of next state

Beyhan Gul
  • 1,191
  • 1
  • 15
  • 25
  • 1
    This is not true. Both methods take the same exact action (ε-greedy). The difference is (as mentioned in other answers) that they use a different policy to update the Q-function. – mobeets Jan 24 '20 at 02:32
0

I didn't read any book just I see the implication of them q learning just focus on the (action grid) SARSA learning just focus on the (state to state) and observe the action list of s and s' and then update the (state to state grid)

  • Your answer could be improved with additional supporting information. Please [edit] to add further details, such as citations or documentation, so that others can confirm that your answer is correct. You can find more information on how to write good answers [in the help center](/help/how-to-answer). – Community Nov 12 '21 at 22:06
0

Both SARSA and Q-learnig agents follow e-greedy policy to interact with environment.

SARSA agent updates its Q-function using the next timestep Q-value with whatever action the policy provides(mostly still greedy, but random action also accepted). The policy being executed and the policy being updated towards are the same.

Q-learning agent updates its Q-function with only the action brings the maximum next state Q-value(total greedy with respect to the policy). The policy being executed and the policy being updated towards are different.

Hence, SARSA is on-policy, Q-learning is off-policy.