Reinforcement learning deterministic policies worse than non deterministic policies

Question

We have a custom reinforcement learning environment within which we run a PPO agent from stable baselines3 for a multi action selection problem. The agent learns as expected but when we evaluate the learned policy from trained agents the agents achieve worse results (i.e. around 50% lower rewards) when we set deterministic=True than with deterministic=False. The goal of the study is to find new policies for a real-world problem and so it would be desirable to find a deterministic policy as this is much better understandable for most people... And it seems counterintuitive that more random actions result in better performance.

The documentation says only "deterministic (bool) – Whether or not to return deterministic actions.". I understand this as deterministic=False means that the actions are drawn from a learned distribution with a certain stochasticity (i.e. one specific state can result in several different actions) and deterministic=True means that the actions are fully based on the learned policy (i.e. one specific state always results in one specific action).

The question is what it says about the agent and / or the environment when the performance is better with deterministic=False than with deterministic=True?

Facing the same behavior with PPO. Would be great if you can share more about your experience in a new answer. I have tried many ways to make my policy more robust so that it does not rely on stochasticity to solve the problem but no success so far (I am documenting my journey here: https://medium.com/@manubotija/list/my-trip-into-reinforcement-learning-d6c244d5aa29) — manubot, Jan 12 '23 at 22:26

score 4 · Accepted Answer · answered Jul 29 '22 at 16:56

You need to be very careful before making stochastic agents deterministic. This is because they can become unable to achieve certain goals. Consider the following over-simplified example with 8 states:

|   | # |   | # |   |
| X |---| G |---| X |

'G' is goal, 'X' is pit, '-' is wall. The '#' states are impossible to fix in a deterministic way. For instance, if the policy at '#' is left then from the two states in the top left the agent will never get to the goal. The strength of stochastic policies is that they can prevent this kind of issue and let the agent find a way to the goal.

Additionally, the stochasticity of the action should reduce over time to reflect the certainty that a particular action is correct, but of course there could be some states (such as '#' above) where significant uncertainty remains.

I do not understand your argument and example because a policy that always goes left on # is a bad policy and if it relies on stochasticity to go right then it means it is still exploring and needs training. I still do not get why a trained policy would require to be stochastic — manubot, Jan 15 '23 at 21:20
Well, it is important to understand. The point is, in this case it can never be trained to find the goal from every state with a fixed policy _where states are so similar they are indistinguishable but require different actions_. If all states are easy to distinguish then of course, get the correct action for each individual state. — Andy, Jan 16 '23 at 22:38

Reinforcement learning deterministic policies worse than non deterministic policies

1 Answers1

Linked