I'm trying to find the optimal policy for a Markov Decision Process problem specified in this diagram, using Value Iteration (via pymdptoolbox) and NumPy. But pymdptoolbox says my transition matrix "is not stochastic".
Is it because of the arrays with [0, 0, 0, 0]? Some transitions are impossible, like from state 1 to state 3. How do I represent these impossible transitions, if not with zeroes?
My code:
import mdptoolbox
import numpy as np
transitions = np.array([
#action1
[
[0.2, 0.8, 0, 0], #s1
[0, 0, 0, 0], #s2
[0, 0, 0, 0], #s3
[0, 0, 0.9, 0.1] #s4
],
#action2
[
[0.2, 0, 0, 0.8], #s1
[0, 0.2, 0.8, 0], #s2
[0, 0, 0, 0], #s3
[0, 0, 0, 0] #s4
],
#action3
[
[0, 0, 0, 0], #s1
[0.8, 0.2, 0, 0], #s2
[0, 0, 0, 1], #s3
[0, 0, 0, 0] #s4
],
#action4
[
[0.8, 0, 0, 0.2], #s1
[0, 0, 0, 0], #s2
[0, 1, 0, 0], #s3
[0, 0, 0, 0] #s4
]
])
rewards = np.array([
[0, 0, 0, 0],
[0, 0, 0, 0],
[1, 1, 1, 1],
[0, 0, 0, 0]
])
vi = mdptoolbox.mdp.ValueIteration(transitions, rewards, 0.4)