Watching this lecture, it says that policy iteration is faster than value iteration. The reasons are:
- Value iteration runs in O(S^2 * A) whereas policy iteration runs in O(S^2) while computing the values, and only the extraction of the policy runs in O(S^2 * A).
- Usually, value iteration reaches the optimal policy way before convergence, and the rest of the computations are wasteful. As opposed to this, policy iteration stops right after there is no change in the policy.
Reading this article, it says that value iteration is faster than policy iteration. Quoting from the article:
As we’ve seen, Policy Iteration evaluates a policy and then uses these values to improve that policy. This process is repeated until eventually the optimal policy is reached. As a result, at each iteration prior to the optimal policy, a sub-optimal policy has to be fully evaluated. Consequently, there is potentially a lot of wasted effort when trying to find the optimal policy.
In Policy Iteration, at each step, policy evaluation is run until convergence, then the policy is updated and the process repeats.
In contrast, Value Iteration only does a single iteration of policy evaluation at each step. Then, for each state, it takes the maximum action value to be the estimated state value. Once these state values have converged, to the optimal state values, then the optimal policy can be obtained. In practice this performs much better than Policy Iteration and finds the optimal state value function in much fewer steps.
I've read a couple of other articles and posts supporting the argument that the policy iteration performs better, but I'm not quite sure.