I am just getting started self-studying reinforcement-learning with stable-baselines 3. My long-term goal is to train an agent to play a specific turn-based boardgame. Currently I am quite overwhelmed with new stuff, though.
I have implemented a gym-environment that I can use to play my game manually or by having it pick random actions.
Currently I am stuck with trying to get a model to hand me actions in response to an observation. The action-space of my environment is a DiscreteSpace(256)
. I create the model with the environment as model = PPO('MlpPolicy', env, verbose=1)
. When I later call model.predict(observation)
I do get back a number that looks like an action. When run repeatedly I get different numbers, which I assume is to be expected on an untrained model.
Unfortunately in my game most of the actions are illegal in most states and I would like to filter them and pick the best legal one. Or simply dump the output result for all the actions out to get an insight on what's happening.
In browsing other peoples code I have seen references to model.action_probability(observation)
. Unfortunately method is not part of stable baselines 3 as far as I can tell. The guide for migration from stable baselines 2 to v3 only mentions it not being implemented [1].
Can you give me a hint on how to go on?