I am trying out the PyBrains maze example
my setup is:
envmatrix = [[...]]
env = Maze(envmatrix, (1, 8))
task = MDPMazeTask(env)
table = ActionValueTable(states_nr, actions_nr)
table.initialize(0.)
learner = Q()
agent = LearningAgent(table, learner)
experiment = Experiment(task, agent)
for i in range(1000):
experiment.doInteractions(N)
agent.learn()
agent.reset()
Now, I am not confident in the results that I am getting
The bottom-right corner (1, 8) is the absorbing state
I have put an additional punishment state (1, 7) in mdp.py:
def getReward(self):
""" compute and return the current reward (i.e. corresponding to the last action performed) """
if self.env.goal == self.env.perseus:
self.env.reset()
reward = 1
elif self.env.perseus == (1,7):
reward = -1000
else:
reward = 0
return reward
Now, I do not understand how, after 1000 runs and 200 interaction during every run, agent thinks that my punishment state is a good state (you can see the square is white)
I would like to see the values for every state and policy after the final run. How do I do that? I have found that this line table.params.reshape(81,4).max(1).reshape(9,9)
returns some values, but I am not sure whether those correspond to values of the value function