Problems with coding Markov Decision Process

Question

I am trying to code Markov-Decision Process (MDP) and I face with some problem. Could you please check my code and find why it isn't works

I have tried to do make it with some small data and it works and give me necessary results, which I feel is correct. But my problem is with generalising of this code. Yeah, I know about MDP library, but I need to code this one. This code works and I want same result in class:

import pandas as pd
data = [['3 0', 'UP', 0.6, '3 1', 5, 'YES'], ['3 0', 'UP', 0.4, '3 2', -10, 'YES'], \
    ['3 0', 'RIGHT', 1, '3 3', 10, 'YES'], ['3 1', 'RIGHT', 1, '3 3', 4, 'NO'], \
    ['3 2', 'DOWN', 0.6, '3 3', 3, 'NO'], ['3 2', 'DOWN', 0.4, '3 1', 5, 'NO'], \
    ['3 3', 'RIGHT', 1, 'EXIT', 7, 'NO'], ['EXIT', 'NO', 1, 'EXIT', 0, 'NO']]

df = pd.DataFrame(data, columns = ['Start', 'Action', 'Probability', 'End', 'Reward', 'Policy'], \
                  dtype = float) #initial matrix

point_3_0, point_3_1, point_3_2, point_3_3, point_EXIT = 0, 0, 0, 0, 0

gamma = 0.9 #it is a discount factor

for i in range(100): 
    point_3_0 = gamma * max(0.6 * (point_3_1 + 5) + 0.4 * (point_3_2 - 10), point_3_3 + 10)
    point_3_1 = gamma * (point_3_3 + 4)
    point_3_2 = gamma * (0.6 * (point_3_3 + 3) + 0.4 * (point_3_1 + 5))
    point_3_3 = gamma * (point_EXIT + 7)


print(point_3_0, point_3_1, point_3_2, point_3_3, point_EXIT)

But here I have a mistake somewhere and it look like too complex? Could you please help me with this issue?!

gamma = 0.9

class MDP:

    def __init__(self, gamma, table):
        self.gamma = gamma
        self.table = table

    def Action(self, state):
        return self.table[self.table.Start == state].Action.values

    def Probability(self, state):
        return self.table[self.table.Start == state].Probability.values

    def End(self, state):
        return self.table[self.table.Start == state].End.values

    def Reward(self, state):
        return self.table[self.table.Start == state].Reward.values

    def Policy(self, state):
        return self.table[self.table.Start == state].Policy.values

mdp = MDP(gamma = gamma, table = df)

def value_iteration():
    states = mdp.table.Start.values
    actions = mdp.Action
    probabilities = mdp.Probability
    ends = mdp.End
    rewards = mdp.Reward
    policies = mdp.Policy

    V1 = {s: 0 for s in states}
    for i in range(100):
        V = V1.copy()
        for s in states:
            if policies(s) == 'YES':
                V1[s] = gamma * max(rewards(s) + [sum([p * V[s1] for (p, s1) \
                in zip(probabilities(s), ends(s))][actions(s)==a]) for a in set(actions(s))])
            else: 
                sum(probabilities[s] * ends(s))

    return V

value_iteration()

I expect values in every point, but get: ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()

Could you attach the full exception trace so we would know which line provoked the exception? I'm betting on `if policies(s) == 'YES'`, you can replace it with `if (policies(s) == 'YES').any()` or `if (policies(s) == 'YES').all()` depending on your need, this happens because `policies(s)` is an array (of booleans) — Rotem Tal, Jun 23 '19 at 18:28
I added: if (policies(s) == 'YES').any(). But here I get another error - TypeError: only integer scalar arrays can be converted to a scalar index. There is a problem in 15th row: ---> 15 in zip(probabilities(s), ends(s))][actions(s)==a]) for a in set(actions(s))]) ---> 16 else: ---> 17 sum(probabilities[s] * ends(s)) TypeError: only integer scalar arrays can be converted to a scalar index. Sorry, I can't add full error log due to characters amount limit — David, Jun 23 '19 at 18:42

score 0 · Answer 1 · answered Jun 23 '19 at 18:41

0

You get the error, because policies(s) = ['YES' 'YES' 'YES'], so it contains 'YES' three times. If you want to check, if all elements in policies(s) are 'YES', simply replace policies(s) == 'YES' with all(x=='YES' for x in policies(s))

If you only want to check for the first element, change to policies(s)[0] == 'YES'

See the Post check if all elements in a list are identical for different approaches.

answered Jun 23 '19 at 18:41

TheJed

5
1
7

Yeah, I see it. So, I want to find a max exactly for those rows where Policy == "YES". For other rows a want just multiply Probability by Value. So, you can see in first code. So, where Policy == "YES" I want find maximum of values, where Policy != "YES", just multiplication – David Jun 23 '19 at 18:47
So, I tried to drop if statement, but the problem stays anyway. – David Jun 23 '19 at 18:48

score 0 · Answer 2 · answered Jun 23 '19 at 19:37

0

For the second problem described (assuming (policies(s) == YES).any() fixed the 1st problem) notice that you initialize a regular python list with this expresion

[sum([p * V[s1] for (p, s1) in zip(probabilities(s), ends(s))]

which you then try to access with the indices [actions(s)==a] python lists don't support multiple indexing, and this cause the TypeError you encountered

answered Jun 23 '19 at 19:37

Rotem Tal

739
5
11

I understand what you mean, so how do you think it is worth solving this problem? – David Jun 23 '19 at 19:50
I'm not 100% sure what you try to achieve, I think you could change inside the zip to this `zip(probabilities(s)[actions(s)==a], ends(s)[actions(s)==a])` if that was your intention, otherwise you could make clear what was the purpose of those indices – Rotem Tal Jun 23 '19 at 20:01
I just want him to go through this loop depending on the number of unique "actions". For example, for the first "State" = 3 0. There is three "actions": "UP, UP, RIGHT" (according to the table). So, using "for a in set(actions(s))" it gives me a set of TWO unique values i.e. "UP, RIGHT" – David Jun 23 '19 at 20:20
Ok, I think I understand, for any action you want to sum over all probabilities and values, you need to reduce `probabilities(s)` and `ends(s)` to contain only values corresponding to some action, the problem is, since you return them as np arrays (when calling .values on the data frame) you lose that information, that is, which probability correspond to which action, if you would return them as pandas Series (just remove the .values in the class methods) you could keep that information, then my previous fix would work (2nd comment) – Rotem Tal Jun 23 '19 at 21:13

Problems with coding Markov Decision Process

2 Answers2