9

In reinforcement learning, I'm trying to understand the difference between policy iteration, and value iteration. There are some general answers to this out there, but I have two specific queries which I cannot find an answer to.

1) I have heard that policy iteration "works forwards", whereas value iteration "works backwards". What does this mean? I thought that both methods just take each state, then look at all the other states it can reach, and compute the value from this -- either by marginalising over the policy's action distribution (policy iteration) or by taking that argmax with regards to the action values (value iteration). So why is there any notion of the "direction" in which each method "moves"?

2) Policy iteration requires an iterative process during policy evaluation, to find the value function -- by However, value iteration just requires one step. Why is this different? Why does value iteration converge in just one step?

Thank you!

Karnivaurus
  • 22,823
  • 57
  • 147
  • 247

2 Answers2

6

The answer provided by @Nick Walker is right and quite complete, however I would like to add a graphical explanation of the difference between Value iteration and Policy iteration, which maybe help to answer the second part of your question.

Both methods, PI and VI, follow the same working principle based in Generalized Policy Iteration. This basically means that they alternate between improving the policy (which requires knowing its value function), and computing the value function of the new, improved, policy.

enter image description here

At the end of this iterative process, both, the value and the policy, converge to the optimal.

However, it was notice that is not necessary to compute exactly the full value function, instead, one step is necessary to allow the convergence. In the following figure, (b) represents the operations performed by Policy Iteration, where the full value function is computed. While (d) shows how Value iteration works.

enter image description here

Obviously this representation of both methods are simplistic, but it highlights the difference between the key idea behind each algorithm.

Pablo EM
  • 6,190
  • 3
  • 29
  • 37
  • How do you evaluate a given policy and how do you create a value function? The value function tells for every state and action the reward. For this would you not have to iterate through all sates and actions (which would lead to extremely many evaluations)? – PeterBe Jul 27 '21 at 08:31
  • Yes, in the "basic" case we can store the value function V(s), or Q-function Q(s,a), for each state or state-action pair (http://incompleteideas.net/sutton/book/ebook/node41.html) and iterate over all states. This assumes the space of states and actions is small and discrete. In real-world problems, this may is not the case. For such a problems, a common solution is to represent value funtion approximately. There are many options for approximate value function, but you can read a good introduction here: http://incompleteideas.net/sutton/book/ebook/node85.html – Pablo EM Jul 27 '21 at 12:03
  • Notice that there exists a second edition of Sutton and Barto's book. You can find the pdf here: http://incompleteideas.net/book/RLbook2020.pdf. However, I'm linking the first edition because its html version allows linking specific sections. – Pablo EM Jul 27 '21 at 12:05
  • Thanks for your comments and links Pablo. I really appreciate it. What I don't understand it when you have a problem that gives you only a reward at the end and you don't have any intermediate rewards. How can you in this case build the Q function? – PeterBe Jul 27 '21 at 12:30
  • Welcome PeterBe! I'm not sure if I have understood your question. But when you only have reward at the end of the episode, in the first run (i.e., when the agent goes until the end) you will be able to update the state previous to the last step (i.e., the state where you get the reward), let's call it s_{end-1}. In the next run, you will also be able to update the step previous to s_{end-1}, and so on. In some way, the Q-function starts to be filled "around" the final step, and it is expanded to all states. – Pablo EM Jul 27 '21 at 16:17
  • In the Figure 7.12 of this link http://incompleteideas.net/book/first/ebook/node77.html, you can see this effect (only one state value is updated in middle grid), and how a technique called "eligibility traces" helps to speed up the learning process (right grid). – Pablo EM Jul 27 '21 at 16:17
2

I have heard that policy iteration "works forwards", whereas value iteration "works backwards". What does this mean?

I can't find anything online that describes policy iteration and value iteration in terms of direction, and to my knowledge this is not a common way to explain the difference between them.

One possibility is that someone was referring to visual impression of values propagating in value iteration. After the first sweep, values are correct on a 1 timestep horizon. Each value correctly tells you what to do to maximize your cumulative reward if you have 1 tilmestep to live. This means that that states that transition to the terminal state and receive a reward have positive values while most everything else is 0. Each sweep, the values become correct for one timestep longer of a horizon. So the values creep backwards from the terminal state towards the start state as the horizon expands. In policy iteration, instead of just propagating values back one step, you calculate the complete value function for the current policy. Then you improve the policy and repeat. I can't say that this has a forward connotation to it, but it certainly lacks the backwards appearance. You may want to see Pablo's answer to a similar question for another explanation of the differences that may help you contextualize what you have heard.

It's also possible that you heard about this forwards-backwards contrast in regards to something related, but different; implementations of temporal difference learning algorithms. In this case, the direction refers to the direction in which you look when making an update to state-action values; forwards means you need to have information about the results of future actions, while backwards means you only need information about things that happened previously. You can read about this in chapter 12 of Reinforcement Learning: An Introduction 2nd edition.

Why does policy iteration have to do a bunch of value function calculations when value iteration seemingly just does one that ends up being optimal? Why does value iteration converge in just one step?

In policy evaluation, we already have a policy and we're just calculating the value of taking actions as it dictates. It repeatedly looks at each state and moves the state's value towards the values of the states that the policy's action will transition to (until the values stop changing and we consider it converged). This value function is not optimal. It's only useful because we can use it in combination with the policy improvement theorem to improve the policy. The expensive process of extracting a new policy, which requires us to maximize over actions in a state, happens infrequently, and policies seem to converge pretty quickly. So even though the policy evaluation step looks like it would be time consuming, PI is actually pretty fast.

Value iteration is just policy iteration where you do exactly one iteration of policy evaluation and extract a new policy at the same time (maximizing over actions is the implicit policy extraction). Then you repeat this iterate-extract procedure over and over until the values stop changing. The fact that these steps are merged together makes it look more straightforward on paper, but maximizing at each sweep is expensive and means that value iteration is often slower than policy iteration.

Community
  • 1
  • 1
Nick Walker
  • 790
  • 6
  • 19