1

I have implemented VI (Value Iteration), PI (Policy Iteration), and QLearning algorithms using python. After comparing results, I noticed something. VI and PI algorithms converge to same utilities and policies. With same parameters, QLearning algorithm converge to different utilities, but same policies with VI and PI algorithms. Is this something normal? I read a lot of papers and books about MDP and RL, but couldn't find anything which tells if utilities of VI-PI algorithms should converge to same utilities with QLearning or not.

Following information is about my grid world and results.

MY GRID WORLD

grid_world.png

  • States => {s0, s1, ... , s10}
  • Actions => {a0, a1, a2, a3} where: a0 = Up, a1 = Right, a2 = Down, a3 = Left for all states
  • There are 4 terminal states, which have +1, +1, -10, +10 rewards.
  • Initial state is s6
  • Transition probability of an action is P, and (1 - p) / 2 to go left or right side of that action. (For example: If P = 0.8, when agent tries to go UP, with 80% chance agent will go UP, and with 10% chance agent will go RIGHT, and 10% LEFT.)

RESULTS

  • VI and PI algorithm results with Reward = -0.02, Discount Factor = 0.8, Probability = 0.8
  • VI converges after 50 iterations, PI converges after 3 iteration

vi_pi_results.png

  • QLearning algorithm results with Reward = -0.02, Discount Factor = 0.8, Learning Rate = 0.1, Epsilon (For exploration) = 0.1
  • Resulting utilities on the image of QLearning results are the maximum Q(s, a) pairs of each state.

qLearning_1million_10million_iterations_results.png

In addition, I also noticed that, when QLearning does 1 million iterations, states which are equally far away from the +10 rewarded terminal have the same utilities. Agent seems it does not care if it is going to the reward from a path which is near to -10 terminal or not, while agent cares about it on VI and PI algorithms. Is this because, in QLearning, we don't know the transition probability of environment?

yoe1323456
  • 35
  • 8

1 Answers1

1

If the state and action spaces are finite, as in your problem, Q-learning algorithm should converge asymptotically to the optimal utility (aka, Q-function), when the number of transitions approaches to infinity and under the following conditions:

enter image description here,

where n is the number of transitions and a is the learning rate. This conditions requires updating your learning rate as learning progresses. A typical choice could be use a_n = 1/n. However, in practice, the learning rate schedule may require some tuning depending on the problem.

On the other hand, another convergence condition consists in update all state-action pairs infinitely (in a asymtotical sense). This could be achieved simply by maintaining an exploration rate bigger than zero.

So, in your case, you need to decrease the learning rate.

Pablo EM
  • 6,190
  • 3
  • 29
  • 37
  • My QLearning algorithm converges, I have no problem with that. The question I am asking is, will the converged utility values for states in Value Iteration Algorithm be the same as Q-function utility values when it converges. – yoe1323456 Dec 29 '17 at 14:55
  • Yes, I understand your question. My answer is telling you that theoretically both methods converge to the same 'optimal' utility function. But for that happens, you need to ensure the conditions above mentioned conditions. Try it and let me know the results. Currently, your Q-learning is converging, but not to the optimal utility function (even you can compute an optimal policy from the computed utility). – Pablo EM Dec 29 '17 at 15:09
  • I tried it. Decaying learning rate 1/n, where n is number of visits for each state. Didn't change the result... Then I realized what I was doing wrong. This environment is stochastic, I changed my implementation for q-learning, considering the stochasticity. Everything seems fine now, except one strange problem. When I set learning rate to 1.0, q-learning finds optimal utilities and policies within 1k iterations, while it can't even converge with decaying learning rate 1/n with 100k iterations. I thought decaying learning rate should converge faster, but it does not. Do you have any opinion? – yoe1323456 Dec 29 '17 at 22:01
  • 1
    Ok, nice to see theory was right! :). Well, learning rate maximum value is 1, so if Q-learning algorithm can converge using that value, likely you couldn't achieve faster convergence using other learning rate scheme. However, in most (non trivial) cases, you should choose a step learning lower than 1. In such situations, probably a more elaborated choice should improve convergence. Anyway, 1/n is far from be a good choice, even if it is theoretically suitable. – Pablo EM Dec 30 '17 at 22:54
  • In book Approximate Dynamic Programming: Solving the Curses of Dimensionality, 2nd Edition, chapter 11 is dedicated almost exclusively to learning rate schedules. Maybe is a too dense reference, but right now I don't remember other alternatives. – Pablo EM Dec 30 '17 at 22:57
  • Thank you for your guidance. After I finish sufficient research about it, i will approve your answer and update my question with correct results. – yoe1323456 Dec 31 '17 at 01:09