So for my master's thesis I'm trying to code the variable elimination algorithm, in this case applied to multi agent MDPs. I'm guiding myself from this example to help me go through the algorithm while I code:
I don't have any problems understanding the algorithm itself, but I'm really stuck in figuring out how to structure this in terms of code, so I came here for help.
For starters, the way I'm representing the individual Q functions. For example, Q1 is the individual Q function of agent 1, where agent 2 also has influence, so I'm representing it as a matrix where the lines are the actions of agent 1, and the columns the actions from agent 2, and each value represents the reward for each pair of actions:
[[0.59852164, 0.39597276, 0.7602051 , 0.13295899],
[0.41479068, 0.03774037, 0.44269462, 0.68948537],
[0.14632375, 0.83137715, 0.79722578, 0.50048093],
[0.03223895, 0.89804766, 0.83578682, 0.95018204]]
Moving on, and here's where I start to have doubts on how to represent this, and where I would start strongly welcoming some suggestions on how to improve. How to represent the individual Q functions plus what agents have influence.
The way I'm doing is with a tuple where the first element is the matrix I talked about previously, the second element is the function's agent (so for Q1 it would be agent 1) and the third and last element the other agent that has influence on the function:
([[0.59852164, 0.39597276, 0.7602051 , 0.13295899],
[0.41479068, 0.03774037, 0.44269462, 0.68948537],
[0.14632375, 0.83137715, 0.79722578, 0.50048093],
[0.03223895, 0.89804766, 0.83578682, 0.95018204]], 'a1', 'a2')
The argument for the function will be an array of these kinds of tuples:
elimination([(a12, "a1", "a2"), (a24, "a2", "a4"), (a13, "a3", "a1"), (a34, "a4", "a3")])
a12 is the matrix I showed before, for example.
So, as you can see, it's already pretty complicated, but I can't find a better way to represent this any better.
So now we move on to the variable elimination part, the way I'm trying to do this is by creating the functions as I go. For example, in the example's first step, it eliminates a4 by creating the function e4(a2,a3) that returns the optimal action for agent 4 taking into account the actions from agents 2 and 3, since those are the ones that show up in the same Q functions as agent 4 (Q2 and Q4)
So what I would want in this case is a matrix representing what action agent 4 should choose taking into account agent's 2 (the lines) and agent's 3 (the columns) action:
[[0, 1, 0, 2],
[1, 2, 0, 3],
[1, 3, 0, 0],
[0, 2, 2, 0]]
For example, the value at [0,1] is 1, because when agent 2 picks action 0 and agent 3 picks action 1, the action agent 4 needs to pick to achieve the best value, is action 1. Of course we reach those values by going through the Q2 and Q4 functions and seeing what action of agent 4 is the best for the combinations of the other actions.
Since 4 is the first agent we eliminate, I need to find the Q functions which contain agent 4, that's pretty easy, since each matrix is accompanied by the strings that tell us which agents influence it:
def elimination(Q):
qual = Q
qual1 = []
qual2= []
e4 = []
e4final = []
e4args= []
e3 = []
e2 = []
e1 = []
#Agent 4 elimination
for q in Q:
if q[1] == "a4" or q[2] == "a4":
e4.append(q)
else:
qual1.append(q)
I go through the array and find the Q function with agent 4, and append them to the array e4 which I will use to create the function later, e4 looks like this:
[(array([[0.59852164, 0.39597276, 0.7602051 , 0.13295899],
[0.41479068, 0.03774037, 0.44269462, 0.68948537],
[0.14632375, 0.83137715, 0.79722578, 0.50048093],
[0.03223895, 0.89804766, 0.83578682, 0.95018204]]), 'a2', 'a4'),
(array([[0.67738865, 0.59291644, 0.60924388, 0.04292563],
[0.14199367, 0.50405062, 0.62515779, 0.59868234],
[0.4568713 , 0.97801569, 0.95928574, 0.67878902],
[0.5427607 , 0.29061713, 0.7378397 , 0.37298815]]), 'a4', 'a3')]
I put the other Q functions on the array qual1 to use it later in the algorithm.
Talking about e4 again, since now we have some concrete values, let's for example try to figure out what action agent 4 has to choose when both agents 2 and 3 choose action 0. For the first matrix we need to take into account the first line:
[0.59852164, 0.39597276, 0.7602051 , 0.13295899]
And for the second matrix, the first column:
[0.67738865, 0.14199367, 0.4568713, 0.5427607]
And figure out what is the largest sum of values of the same index (the agent 4 action), in this case, is 0 (0.59852164 + 0.67738865), so agent 4 picks 0 when agent 2 and 3 pick 0.
The goal is to do this for every combination of agent 2 and agent 3 actions.
I believe I can figure out how to get e4 but after that is where I get stuck, I feel like the problem is the way I structured this from the start, but I don't have more ideas. This is also the very start of the algorithm, I also need to worry about making it flexible to work for different kinds of connections between agents, although probably I won't need to worry about having a Q function with 3 agents influencing it. I also need to figure out to incorporate the functions into the variable elimination, since in the case of agent 4 I'm only working with the Q functions, but as you can see from the example, the function for agent 3 will contain function e4.
Any kind of help is appreciated, maybe there's a simpler way to represent this that makes the whole thing easier, but I'm really having a hard time figuring that out. Thanks in advance.