I am learning about MDP
's and value iteration
in self-study and I hope someone can improve my understanding.
Consider the problem of a 3 sided dice having numbers 1, 2, 3
. If you roll a 1
or a 2
you get that value in $
but if you roll a 3
you loose all your money and the game ends (finite horizon problem
)
Conceptually I understand how this done with the following forumla:
So let's break that down:
Since this is a finite horizon
problem we can ignore gamma
.
If I observe 1
, I can either go
or stop
. The utility/value
of that is:
V(1) = max(Q(1, g), Q(1, s))
Q(1, g) = r + SUM( P( 2 | 1,g) * V(2) + P( 3 | 1,g) * V(3))
Q(1, s) = r + SUM( P( 2 | 1,s) * V(2) + P( 3 | 1,s) * V(3))
where r = 1
I observe 2
, I can either go
or stop
:
V(2) = max(Q(2, g), Q(2, s))
Q(2, g) = r + SUM( P( 1 | 2,g) * V(1) + P( 3 | 1,g) * V(3))
Q(2, s) = r + SUM( P( 1 | 2,s) * V(1) + P( 3 | 1,s) * V(3))
where r = 2
I observe 3, the game ends.
Intuitively V(3)
is 0
because the game is over, so we can remove that half from the equation of Q(1, g)
. We defined V(2)
above also so we can substitute that as:
Q(1, g) = r + SUM( P( 2 | 1,g) *
MAX ((P( 1 | 2,g) * V(1)) , (P( 1 | 2,s) * V(1))))
This where things take a bad turn. I am not sure how to solve Q(1, g)
if it has its own definition in its solution. This likely due to poor math background.
What I do understand is that the utilities or the values of the states will change based on the reward and therefore the decision will change.
Specifically if rolling three gave you $3
while rolling one ended the game, that will affect your decision because the utility has changed.
But I am not sure how to write code to calculate that.
Can someone explain how Dynamic Programming works in this? How do I solve Q(1,g)
or Q(1,s)
when it is in its own definition?