Dynamic Programming of Markov Decision Process with Value Iteration

Question

I am learning about MDP's and value iteration in self-study and I hope someone can improve my understanding.

Consider the problem of a 3 sided dice having numbers 1, 2, 3. If you roll a 1 or a 2 you get that value in $ but if you roll a 3 you loose all your money and the game ends (finite horizon problem)

Conceptually I understand how this done with the following forumla:

So let's break that down:

Since this is a finite horizon problem we can ignore gamma.

If I observe 1, I can either go or stop. The utility/value of that is:

V(1) = max(Q(1, g), Q(1, s))
Q(1, g) = r + SUM( P( 2 | 1,g) * V(2) + P( 3 | 1,g) * V(3))
Q(1, s) = r + SUM( P( 2 | 1,s) * V(2) + P( 3 | 1,s) * V(3))

where r = 1

I observe 2, I can either go or stop:

V(2) = max(Q(2, g), Q(2, s))
Q(2, g) = r + SUM( P( 1 | 2,g) * V(1) + P( 3 | 1,g) * V(3))
Q(2, s) = r + SUM( P( 1 | 2,s) * V(1) + P( 3 | 1,s) * V(3))

where r = 2

I observe 3, the game ends.

Intuitively V(3) is 0 because the game is over, so we can remove that half from the equation of Q(1, g). We defined V(2) above also so we can substitute that as:

Q(1, g) = r + SUM( P( 2 | 1,g) *     
    MAX ((P( 1 | 2,g) * V(1)) , (P( 1 | 2,s) * V(1))))

This where things take a bad turn. I am not sure how to solve Q(1, g) if it has its own definition in its solution. This likely due to poor math background.

What I do understand is that the utilities or the values of the states will change based on the reward and therefore the decision will change.

Specifically if rolling three gave you $3 while rolling one ended the game, that will affect your decision because the utility has changed.

But I am not sure how to write code to calculate that.

Can someone explain how Dynamic Programming works in this? How do I solve Q(1,g) or Q(1,s) when it is in its own definition?

If you are only interested in computing the state value `V*(s)` you don't need to use action-state value functions `Q*(s,a)`. On the other hand, finding the optimal value function in a given MDP typically can not be solved analytically. Dynamic programming works by applying an iterative procedure that converges to the solution. — Pablo EM, Aug 27 '17 at 10:03

score 3 · Answer 1 · answered Aug 26 '17 at 18:22

Special solution:

For your example, it is pretty easy to know whether "go" or "stop" should be chosen: there is a money-value X for which it is the same whether you "go" or "stop", for all smaller value you should "go", for all bigger values you should stop. So the only question, what is this value:

X=E("stop"|X)=E("go"|X)=1/3(1+X)+1/3(2+x) =>
1/3X=1 =>
X=3

Already in the first line, I used that even if I choose "go" and win I will choose stop in the next round. So knowing what decision should be made, it is easy to calculate the expected win with the perfect strategy, here in python:

def calc(money):
    PROB=1.0/3.0
    if money<3:#go
       return  PROB*calc(money+1)+PROB*calc(money+2)-PROB*0
    else:#stop
       return money 

print "Expected win:", calc(0)

>>> Expected win: 1.37037037037

General solution:

I'm not sure the above course of action can be generalized for arbitrary scenarios. However, there is another possibility to solve such problems.

Let's change the game a little bit: No longer infinitely many turns are possible, but at most N turns. Then your recursion becomes:

E(money, N)=max(money, 1/3*E(money+1, N-1)+1/3*E(money+1, N-1))

As you can easily see the value E(money, N) no longer depends on itself but on results of a game with smaller number of turns.

Without a proof, I state, that the value you are looking for is E(money)=lim_{N->infinity} E(money, N).

For you special problem the python code would look like follows:

PROB=1.0/3.0

MAX_GOS=20#neglect all possibilities with more than 1000 decisions "GO"

LENGTH=2*MAX_GOS+1#per go 2$ are possible

#What is expected value if the game ended now?
expected=range(LENGTH)

for gos_left in range(1,MAX_GOS+1):
   next=[0]*len(expected)
   for money in range(LENGTH-gos_left*2):
       next[money]=max(expected[money], PROB*expected[money+1]+PROB*expected[money+2])#decision stop or go
   expected=next

print "Expected win:", expected[0]

>>> Expected win: 1.37037037037

I'm glad both methods yielded the same result!

Dynamic Programming of Markov Decision Process with Value Iteration

1 Answers1