I am taking a Reinforcement Learning class and I didn’t understand how to combine the concepts of policy iteration/value iteration with Monte Carlo (and also TD/SARSA/Q-learning). In the table below, how can the empty cells be filled: Should/can it be binary yes/no, some string description or is it more complicated?

- 863
- 3
- 13
- 28
-
How did the homework turn out? – R.F. Nelson May 22 '18 at 18:09
-
Thanks for the help! It's not homework I just put the table together to try and make sense of concepts which are hard to separate. What do you mean by "traditionally value iteration and policy iteration are not considered RL" - so TD and its variants are not applying value/policy iteration? – Johan May 23 '18 at 12:05
-
Any update on this question? I actually want to make sense of this too. @Johan – hridayns Jan 08 '20 at 12:04
-
I guess the answer is “it’s more complicated”. The main problem with the table is that the rows show reinforcement learning whereas the columns show dynamic programming (optimality planning). Although RL is to a large (but varying) degree based on DP a direct comparison is not very meaningful due to: DP is model based (known transition dynamics) and does not sample the state-space, whereas RL is model free and samples. Check “RL course by David Silver” on youtube (lectures 3-4) for a good explanation. – Johan Jan 08 '20 at 16:23
1 Answers
Value iteration and policy iteration are model-based methods of finding an optimal policy. They try to construct the Markov decision process (MDP) of the environment. The main premise behind reinforcement learning is that you don't need the MDP of an environment to find an optimal policy, and traditionally value iteration and policy iteration are not considered RL (although understanding them is key to RL concepts). Value iteration and policy iteration learn "indirectly" because they form a model of the environment and can then extract the optimal policy from that model.
"Direct" learning methods do not attempt to construct a model of the environment. They might search for an optimal policy in the policy space or utilize value function-based (a.k.a. "value based") learning methods. Most approaches you'll learn about these days tend to be value function-based.
Within value function-based methods, there are two primary types of reinforcement learning methods:
- Policy iteration-based methods
- Value iteration-based methods
Your homework is asking you, for each of those RL methods, if they are based on policy iteration or value iteration.
A hint: one of those five RL methods is not like the others.

- 2,254
- 2
- 12
- 24