PyBrains Q-Learning maze example. State values and the global policy

Question

I am trying out the PyBrains maze example

my setup is:

envmatrix = [[...]]
env = Maze(envmatrix, (1, 8))
task = MDPMazeTask(env)
table = ActionValueTable(states_nr, actions_nr)
table.initialize(0.)
learner = Q()
agent = LearningAgent(table, learner)
experiment = Experiment(task, agent)
for i in range(1000):
    experiment.doInteractions(N)
    agent.learn()
    agent.reset()

Now, I am not confident in the results that I am getting

The bottom-right corner (1, 8) is the absorbing state

I have put an additional punishment state (1, 7) in mdp.py:

def getReward(self):
    """ compute and return the current reward (i.e. corresponding to the last action performed) """
    if self.env.goal == self.env.perseus:
        self.env.reset()
        reward = 1
    elif self.env.perseus == (1,7):
        reward = -1000
    else:
        reward = 0
    return reward

Now, I do not understand how, after 1000 runs and 200 interaction during every run, agent thinks that my punishment state is a good state (you can see the square is white)

I would like to see the values for every state and policy after the final run. How do I do that? I have found that this line table.params.reshape(81,4).max(1).reshape(9,9) returns some values, but I am not sure whether those correspond to values of the value function

Boris Mocialov · Answer 1 · 2015-11-30T20:59:20.280

Now I added another constraint - made the agent to always start from the same position: (1, 1) by adding self.initPos = [(1, 1)] in maze.py and now I get this behaviour after 1000 runs with each run having 200 interactions:

Which kind of makes sense now - the robot tries to go around the wall from another side, avoiding the state (1, 7)

So, I was getting weird results because the agent used to start from random positions, which also included the punishing state

EDIT:

Another point is that if it is desirable to spawn the agent randomly, then make sure it is not spawned in the punishable state

def _freePos(self):
    """ produce a list of the free positions. """
    res = []
    for i, row in enumerate(self.mazeTable):
        for j, p in enumerate(row):
            if p == False:
                if self.punishing_states != None:
                    if (i, j) not in self.punishing_states:
                        res.append((i, j))
                else:
                    res.append((i, j))
    return res

Also, seems then that table.params.reshape(81,4).max(1).reshape(9,9) returns the value for every state from the value function

PyBrains Q-Learning maze example. State values and the global policy

1 Answers1