RL + optimization: how to do it better?

Question

I am learning about how to optimize using reinforcement learning. I have chosen the problem of maximum matching in a bipartite graph as I can easily compute the true optimum.

Recall that a matching in a graph is a subset of the edges where no two edges are incident on the same node/vertex. The goal is to find the largest such subset.

I show my full code below but first let me explain parts of it.

num_variables = 1000
g = ig.Graph.Random_Bipartite(num_variables, num_variables, p=3/num_variables)
g_matching = g.maximum_bipartite_matching()
print("Matching size", len([v for v in g_matching.matching if v < num_variables and v != -1]))

This makes a random bipartite graph with 1000 nodes in each of the two sets of nodes. It then prints out the size of the true maximum matching.

In the code below, self.agent_pos is an array representing the current matching found. Its length is the number of edges in the original graph and there is a 1 at index i if edge i is included and a 0 otherwise. self.matching is the set of edges in the growing matching. self.matching_nodes is the set of nodes in the growing matching that is used to check to see if a particular edge can be added or not.

import igraph as ig
from tqdm import tqdm
import numpy as np
import gym
from gym import spaces

from stable_baselines3 import PPO
from stable_baselines3.common.env_util import make_vec_env

class MaxMatchEnv(gym.Env):
    metadata = {'render.modes': ['console']}
    def __init__(self, array_length=10):
        super(MaxMatchEnv, self).__init__()
        # Size of the 1D-grid
        self.array_length = array_length
        self.agent_pos = [0]*array_length
        self.action_space = spaces.Discrete(array_length)
        self.observation_space = spaces.Box(low=0, high=1, shape=(array_length,), dtype=np.uint8)
        self.matching = set()  # set of edges
        self.matching_nodes = set() # set of node ids (ints)
        self.matching_size = len([v for v in g_matching.matching if v < num_variables and v != -1])
        self.best_found = 0
        
    def reset(self):
        # Initialize the array to have random values
        self.time = 0
        #print(self.agent_pos)
        self.agent_pos = [0]*self.array_length
        self.matching = set()
        self.matching_nodes = set()
        return np.array(self.agent_pos)
    
        
    def step(self, action):
        self.time += 1 
        reward = 0
        edge = g.es[action]
        if not(edge.source in self.matching_nodes or edge.target in self.matching_nodes):
            self.matching.add(edge)
            self.matching_nodes.add(edge.source)
            self.matching_nodes.add(edge.target)
            self.agent_pos[action] = 1
            if sum(self.agent_pos) > self.best_found:
                self.best_found = sum(self.agent_pos)
                print("New max", self.best_found)
            reward = 1
        elif self.agent_pos[action] == 1:
            #print("Removing edge", action)
            self.matching_nodes.remove(edge.source)
            self.matching_nodes.remove(edge.target)
            self.matching.remove(edge)
            self.agent_pos[action] = 0
            reward = -1
        done = sum(self.agent_pos) == self.matching_size
        info = {}
        return np.array(self.agent_pos), reward, done, info

    def render(self, mode='console'):
        print(sum(self.agent_pos))

    def close(self):
        pass


if __name__ == '__main__':
 
    num_variables = 1000
    g = ig.Graph.Random_Bipartite(num_variables, num_variables, p=3/num_variables)
    g_matching = g.maximum_bipartite_matching()
    print("Matching size", len([v for v in g_matching.matching if v < num_variables and v != -1]))

    env = make_vec_env(lambda: MaxMatchEnv(array_length=len(g.es)), n_envs=12)

    model = PPO('MlpPolicy', env, verbose=1).learn(10000000)

There are a number of problems with this but the main one is that it doesn't optimize well. This code gives just over 550 and then stops improving where the true optimum is over 900 (it is printed out by the code at the start).

The main question is:

How can this be done better so that it gets to a better matching?

A subsidiary question is, how can I print the best matching found so far? My attempt using self.best_found to maintain the best score does not work as it seems to be reset regularly.

Changes that haven't help

Changing PPO for DQN makes only a marginal difference.
I tried changing the code so that done is True after 1000 steps.

The change is as follows:

if self.time == 1000:
    done = True
else:
    done = False

Having added print(max(env.get_attr("best_found"))) in place of print("New max", self.best_found) this change to done shows no advantage at all.

Reinforcement Learning does not always converge towards an optimal model, sometimes a different seed can give tremendously different models. Have you also considered other policy than MlpPolicy? — Jean Bouvattier, May 02 '22 at 12:26
besides, maybe you could find better help on https://ai.stackexchange.com/ — Jean Bouvattier, May 02 '22 at 12:28
@JeanBouvattier Happy to try anything else. What do you suggest? I guess I can't post on ai.se until this bounty is complete now? I had assumed my code was just buggy to be honest. — byteful, May 02 '22 at 13:06
your code does converge to a solution albeit not an optimal one, so I don't think there is a technical bug. Anyway stable_baseline3 propose multiple RL algorithms : https://stable-baselines3.readthedocs.io/en/master/guide/algos.html maybe another one is best suited for your example — Jean Bouvattier, May 02 '22 at 13:15
@JeanBouvattier I tried DQN just now. It doesn't seem any better (gets to 572 I think where the true opt is 934 with random seed 7). How can I change my code to print the highest value found so far? — byteful, May 02 '22 at 13:19

Ignacio · Answer 1 · 2022-05-03T00:54:01.077

0

To print the max for each environment you can use get_attr method from stable baselines. More info in their official docs.

For example the lines below will print the max for each of the 12 environments and then the maximum across all environments.

print(env.get_attr("best_found"))
print(max(env.get_attr("best_found")))

Regarding why it does not converge it could be due a bad reward selected, although looking at your reward choice it seems sensible. I added a debug print in your code to see if some step lead to done = True, but it seems that the environment never reaches that state. I think that for the model to learn it would be good to have multiple sequence of actions leading to a state with done = True, which would mean that the model gets to experience the end of an episode. I did not study the problem in your code in detail but maybe this information can help debug your issue.

If we compare the problem with other environments like CartPole, there we have episodes that end with done = True and that helps the model learn a better policy (in your case you could limit the amount of actions per episode instead of running the same episode forever). That could help the model avoid getting stuck in a local optimum as you give it the opportunity to "try again" in a new episode.

edited May 03 '22 at 00:54

answered May 03 '22 at 00:06

Ignacio

386
4
19

Thank you for `print(max(env.get_attr("best_found")))`. I tried changing the code with `if self.time == 1000: done = True else: done = False` but it didn't help. Did you find something that worked? – byteful May 03 '22 at 08:26
In fact the results with if self.time == 1000: done = True are very odd. `ep_rew_mean` **decreases** over time. Any idea why? – byteful May 03 '22 at 11:31
It could be because you are setting it to done but not reseting the simulation? If you do not reset the simulation then it is harder for the model to get new rewards. You could also try increasing the time parameter to see if more time is needed for the model to get a higher reward per episode. – Ignacio May 15 '22 at 12:55
I answered also to the other question you posted, hopefully that helps – Ignacio May 15 '22 at 13:00
Sadly your suggestion of adding self.reset() seems to make no difference. Did you try it yourself? – byteful May 18 '22 at 09:08
I tried it with a lower value of `num_variables`. One thing to take into account is that the action space increases a lot with higher values of `num_variables`, not sure if that explains why it gets stuck around 550. – Ignacio May 27 '22 at 21:57

score 0 · Answer 2 · answered May 07 '22 at 03:40

0

If I'm reading this right, then you aren't performing any hyperparameter tuning. How do you know your learning rate / policy update rate / any other variable parameters are a good fit for the problem at hand?

It looks like stable baselines has this functionality built-in: https://stable-baselines.readthedocs.io/en/master/guide/rl_zoo.html?highlight=tune#hyperparameter-optimization

answered May 07 '22 at 03:40

Mandias

742
5
17

Have you found any hyperparameters that help for OP’s code? – Simd May 07 '22 at 03:54
I will happily award the bounty to anyone who can get much better results by changing hyperparameters for this problem. – Simd May 07 '22 at 08:11

RL + optimization: how to do it better?

2 Answers2

Linked