140

In reinforcement learning, what is the difference between policy iteration and value iteration?

As much as I understand, in value iteration, you use the Bellman equation to solve for the optimal policy, whereas, in policy iteration, you randomly select a policy π, and find the reward of that policy.

My doubt is that if you are selecting a random policy π in PI, how is it guaranteed to be the optimal policy, even if we are choosing several random policies.

Blaszard
  • 30,954
  • 51
  • 153
  • 233
Arslán
  • 1,711
  • 2
  • 12
  • 15
  • 14
    It would have been more appropriate to ask this question on websites such as https://ai.stackexchange.com/, https://stats.stackexchange.com or https://datascience.stackexchange.com. – nbro Dec 17 '17 at 00:12
  • 2
    I’m voting to close this question because [Machine learning (ML) theory questions are off-topic on Stack Overflow](https://meta.stackoverflow.com/questions/291009/do-pure-machine-learning-questions-belong-to-stack-overflow/291015#291015) - [gift-wrap candidate for Cross-Validated](https://meta.stackoverflow.com/questions/404799/lets-gift-wrap-our-good-machine-learning-theory-questions-for-cross-validated?noredirect=1#comment822113_404799) – Daniel F Feb 10 '21 at 13:54

5 Answers5

179

Let's look at them side by side. The key parts for comparison are highlighted. Figures are from Sutton and Barto's book: Reinforcement Learning: An Introduction.

enter image description here Key points:

  1. Policy iteration includes: policy evaluation + policy improvement, and the two are repeated iteratively until policy converges.
  2. Value iteration includes: finding optimal value function + one policy extraction. There is no repeat of the two because once the value function is optimal, then the policy out of it should also be optimal (i.e. converged).
  3. Finding optimal value function can also be seen as a combination of policy improvement (due to max) and truncated policy evaluation (the reassignment of v_(s) after just one sweep of all states regardless of convergence).
  4. The algorithms for policy evaluation and finding optimal value function are highly similar except for a max operation (as highlighted)
  5. Similarly, the key step to policy improvement and policy extraction are identical except the former involves a stability check.

In my experience, policy iteration is faster than value iteration, as a policy converges more quickly than a value function. I remember this is also described in the book.

I guess the confusion mainly came from all these somewhat similar terms, which also confused me before.

nbro
  • 15,395
  • 32
  • 113
  • 196
zyxue
  • 7,904
  • 5
  • 48
  • 74
  • 7
    I agree that policy iteration converges in fewer iterations and I've also read in several places that it is faster. I did some simple box-world and maze solving experiments with both methods in Burlap. I found that value iteration performed more iterations but took less time to reach convergence. YMMV. – Ryan Jan 25 '18 at 01:17
  • @Ryan Have you seen the Grid world? Well value iteration converges faster than policy iteration. I am not quite sure about other systems but I have read in Richard Suttons book that value iteration is better. Why? -- Because when doing policy iteration, in policy evaluation step you should evaluate the policy until convergence even though evaluation after some step is unnecessary. – Cbrom Mar 11 '18 at 12:51
  • 2
    @Chrom, you should have read the oppposite. Here is a quote from the book, "*Policy iteration often converges in surprisingly few iterations. This is illustrated by the example in Figure 4.1.*", from Page 65 of the [2017nov5](http://incompleteideas.net/book/bookdraft2017nov5.pdf) version of the book. – zyxue Mar 11 '18 at 22:41
  • 7
    Yes, I've played with several flavors of Grid world. I was just trying to point out that "Faster" in terms of iterations is probably going to favor PI. But "Faster" in terms of seconds might actually favor VI. – Ryan Mar 12 '18 at 18:44
  • 7
    To clarify, policy iteration will take fewer iterations but is more computationally complex than value iteration; which one is faster depends on the environment. – R.F. Nelson May 10 '18 at 14:02
  • 3
    I know that this is an old post. But I highly suggest, looking into this (https://medium.com/@m.alzantot/deep-reinforcement-learning-demysitifed-episode-2-policy-iteration-value-iteration-and-q-978f9e89ddaa) The link provides an code and it made it much clearer for me. – tandem Mar 23 '19 at 13:42
  • 1
    @zyxue Probably not worth a whole new question... why does the policy iteration algorithm use the policy to select a single action for the value function update in step 2 rather than the expectation over all possible actions (as shown in the earlier policy evaluation definition a few pages earlier). Is it just on the basis of assuming deterministic policies at that point in the book? – Doug Brummell May 18 '20 at 08:43
99

In policy iteration algorithms, you start with a random policy, then find the value function of that policy (policy evaluation step), then find a new (improved) policy based on the previous value function, and so on. In this process, each policy is guaranteed to be a strict improvement over the previous one (unless it is already optimal). Given a policy, its value function can be obtained using the Bellman operator.

In value iteration, you start with a random value function and then find a new (improved) value function in an iterative process, until reaching the optimal value function. Notice that you can derive easily the optimal policy from the optimal value function. This process is based on the optimality Bellman operator.

In some sense, both algorithms share the same working principle, and they can be seen as two cases of the generalized policy iteration. However, the optimality Bellman operator contains a max operator, which is non linear and, therefore, it has different features. In addition, it's possible to use hybrid methods between pure value iteration and pure policy iteration.

nbro
  • 15,395
  • 32
  • 113
  • 196
Pablo EM
  • 6,190
  • 3
  • 29
  • 37
  • 1
    Nice description on this . Well let me add this thing in policy iteration it uses belman expectation equation and in value iteration uses melman maximum equation . For value iteration it can be less iterations but for one iteration there can be so much of work. For the policy iteration more iterations – Shamane Siriwardhana Nov 07 '17 at 05:40
  • isn't there a max operator in policy iteration as well? otherwise how to update the policy based on the new value function? – hzh Apr 07 '18 at 20:04
  • Nope, SARSA algorithm is a typical example of policy iteration. As you can see in this pseudo code (http://incompleteideas.net/book/ebook/node64.html), the value function update doesn't contain any max operator. However, if you mean a max operator for choosing the best actions from the value function (i.e. greedy actions), yes, there is a max operation in such process. – Pablo EM Apr 08 '18 at 15:19
23

The basic difference is -

In Policy Iteration - You randomly select a policy and find value function corresponding to it , then find a new (improved) policy based on the previous value function, and so on this will lead to optimal policy .

In Value Iteration - You randomly select a value function , then find a new (improved) value function in an iterative process, until reaching the optimal value function , then derive optimal policy from that optimal value function .

Policy iteration works on principle of “Policy evaluation —-> Policy improvement”.

Value Iteration works on principle of “ Optimal value function —-> optimal policy”.

Himanshu Gupta
  • 346
  • 2
  • 7
1

As far as I am concerned, in contrary to @zyxue 's idea, VI is generally much faster than PI.

The reason is very straightforward, as you already knew, Bellman Equation is used for solving value function for given policy. Since we can solve the value function for optimal policy directly, solving value function for current policy is obviously a waste of time.

As for your question about the convergency of PI, I think you might overlook the fact that if you improve the strategy for each information state, then you improve the strategy for the whole game. This is also easy to prove, if you were familiar with Counterfactual Regret Minimization -- the sum of the regret for each information state has formed the upperbound of the overall regret, and thus minimizing the regret for each state will minimize the overall regret, which leads to the optimal policy.

T. Jiang
  • 55
  • 5
1

The main difference in speed is due to the max operation in every iteration of value iteration (VI).

In VI, each state will use just one action (with the max utility value) for calculating the updated utility value, but it first has to calculate the value of all possible actions in order to find this action via the Bellman Equation.

In policy iteration (PI), this max operation is ommited in step 1 (policy evaluation) by just following the intermediate policy to choose the action.

If there are N possible actions, VI has to calculate the bellman equation N times for each state and then take the max, whereas PI just calculates it one time (for the action stated by the current policy).

However in PI, there is a policy improvement step that still uses the max operator and is as slow as a step in VI, but since PI converges in less iterations, this step won't happen as often as in VI.

bcbertyboy
  • 31
  • 3