I am learning about how to optimize using reinforcement learning. I have chosen the problem of maximum matching in a bipartite graph as I can easily compute the true optimum.
Recall that a matching in a graph is a subset of the edges where no two edges are incident on the same node/vertex. The goal is to find the largest such subset.
I show my full code below but first let me explain parts of it.
num_variables = 1000
g = ig.Graph.Random_Bipartite(num_variables, num_variables, p=3/num_variables)
g_matching = g.maximum_bipartite_matching()
print("Matching size", len([v for v in g_matching.matching if v < num_variables and v != -1]))
This makes a random bipartite graph with 1000 nodes in each of the two sets of nodes. It then prints out the size of the true maximum matching.
In the code below, self.agent_pos
is an array representing the current matching found. Its length is the number of edges in the original graph and there is a 1 at index i
if edge i
is included and a 0 otherwise. self.matching
is the set of edges in the growing matching. self.matching_nodes
is the set of nodes in the growing matching that is used to check to see if a particular edge can be added or not.
import igraph as ig
from tqdm import tqdm
import numpy as np
import gym
from gym import spaces
from stable_baselines3 import PPO
from stable_baselines3.common.env_util import make_vec_env
class MaxMatchEnv(gym.Env):
metadata = {'render.modes': ['console']}
def __init__(self, array_length=10):
super(MaxMatchEnv, self).__init__()
# Size of the 1D-grid
self.array_length = array_length
self.agent_pos = [0]*array_length
self.action_space = spaces.Discrete(array_length)
self.observation_space = spaces.Box(low=0, high=1, shape=(array_length,), dtype=np.uint8)
self.matching = set() # set of edges
self.matching_nodes = set() # set of node ids (ints)
self.matching_size = len([v for v in g_matching.matching if v < num_variables and v != -1])
self.best_found = 0
def reset(self):
# Initialize the array to have random values
self.time = 0
#print(self.agent_pos)
self.agent_pos = [0]*self.array_length
self.matching = set()
self.matching_nodes = set()
return np.array(self.agent_pos)
def step(self, action):
self.time += 1
reward = 0
edge = g.es[action]
if not(edge.source in self.matching_nodes or edge.target in self.matching_nodes):
self.matching.add(edge)
self.matching_nodes.add(edge.source)
self.matching_nodes.add(edge.target)
self.agent_pos[action] = 1
if sum(self.agent_pos) > self.best_found:
self.best_found = sum(self.agent_pos)
print("New max", self.best_found)
reward = 1
elif self.agent_pos[action] == 1:
#print("Removing edge", action)
self.matching_nodes.remove(edge.source)
self.matching_nodes.remove(edge.target)
self.matching.remove(edge)
self.agent_pos[action] = 0
reward = -1
done = sum(self.agent_pos) == self.matching_size
info = {}
return np.array(self.agent_pos), reward, done, info
def render(self, mode='console'):
print(sum(self.agent_pos))
def close(self):
pass
if __name__ == '__main__':
num_variables = 1000
g = ig.Graph.Random_Bipartite(num_variables, num_variables, p=3/num_variables)
g_matching = g.maximum_bipartite_matching()
print("Matching size", len([v for v in g_matching.matching if v < num_variables and v != -1]))
env = make_vec_env(lambda: MaxMatchEnv(array_length=len(g.es)), n_envs=12)
model = PPO('MlpPolicy', env, verbose=1).learn(10000000)
There are a number of problems with this but the main one is that it doesn't optimize well. This code gives just over 550 and then stops improving where the true optimum is over 900 (it is printed out by the code at the start).
The main question is:
How can this be done better so that it gets to a better matching?
A subsidiary question is, how can I print the best matching found so far? My attempt using self.best_found to maintain the best score does not work as it seems to be reset regularly.
Changes that haven't help
- Changing PPO for DQN makes only a marginal difference.
- I tried changing the code so that
done
is True after 1000 steps.
The change is as follows:
if self.time == 1000:
done = True
else:
done = False
Having added print(max(env.get_attr("best_found")))
in place of print("New max", self.best_found)
this change to done
shows no advantage at all.