What does "deterministic=True" in stable baselines3 library means?

Question

I am trying to apply the PPO algorithm from the stable baselines3 library https://stable-baselines3.readthedocs.io/en/master/ to a custom environment I made.

One thing I don't understand is the following line:

mean_reward, std_reward = evaluate_policy(model, env, n_eval_episodes=10, deterministic=True)

Should I always let deterministic equal True? When I keep deterministic="True", my custom environment "somehow" is always solved (i.e., always returning reward of 1 +/- 0 std).

And when I change it to "False", it starts behaving in a reasonable way (i.e., sometimes it succeeds (reward=1) and sometimes it fails (reward=0).

If I am not mistaken, stable baselines takes a random sample based on some distribution when using deterministic is False. This means that if the model prediction is not sure of what to pick, you get a higher level of randomness, which increases the exploration. During evaluation you generally don't want to explore, but exploit the model. Therefore deterministic should be True, which always returns the best action. When using deterministic is False, you won't always get the best action, but sometimes less optimal action picked at random (based on your model confidence). — Thymen, Mar 03 '21 at 11:05
I actually tested it (deterministic= "True") before and after training the model. And even before training the model, the reward was always 1, which was very unusual. Can you explain why an untrained model would be successful! — mac179, Mar 03 '21 at 15:30
I don't know the environment you are training on, can you provide some action distributions of the model (with deterministic = `True` and `False`) and a small explanation of your environment? That would help thinking about what goes wrong. My thoughts so far is that the default model initialization already solves the environment, and you never get a different action (that makes you fail). — Thymen, Mar 03 '21 at 16:01
I am sorry @Thymen for the late reply, yes here is my gym environment I just uploaded it to my GitHub, here it is: https://github.com/amine179/mygym_environment/tree/main my environment is trying to teach an agent to keep two scores above a certain threshold, one is proportional to its actions and the other is inversely proportional to its actions. Therefore it should always adapt its actions to prevent the two scores from going extreme. Thank you for your help, I will be waiting for your trial results. — mac179, Mar 05 '21 at 07:35
Could you add your PPO code with experiments, so i can directly run it and verify it myself? Because looking at your environment my initial thoughts don't make much sense, because you have a continuous action space. — Thymen, Mar 05 '21 at 09:16
There you go! I uploaded the code used to train and save my PPO algorithm and updated the test file so you could try it. This been said! I noticed that I had a mistake in the environment that I corrected! now the PPO algorithm isn't able to solve the environment which I think is just a matter of training. Nevertheless, I would be glad if you could check on your own I might be mistaking since I am new to Deep RL in general. Also, the rendering function seems too much slow, is there any way I can make it a bit faster. Thanks a lot — mac179, Mar 05 '21 at 09:56
Running your code for 100_000 steps and Determinstic=True, leads to a start of `0.` and end of `49.` With Determinstic=False, start `0.` and end `31`. Which seem reasonable. For the rendering, the reason that it is slow is because you are re rendering the whole plot every time with more data. The best way to handle that is either making it a separate process and using a queue to transfer the data. Or make a render interval, every 20 steps for example. — Thymen, Mar 05 '21 at 10:32
Yes that's what I got for the training too, I was using a small number of steps (5000) that's why it never learned previously. Thanks. For the plotting, I am aware of the second suggestion, but for the first one (using a queue...) I don't know. Can you provide me with an example if it's possible. — mac179, Mar 05 '21 at 11:16
An example of plotting using a separate process can be found [here](https://stackoverflow.com/questions/36181316/python-matplotlib-plotting-in-another-process). — Thymen, Mar 05 '21 at 13:13
Is it not just Deterministic Policy when `deterministic=True` and Stochastic Policy otherwise? — chupa_kabra, Mar 14 '22 at 03:53

score 7 · Answer 1 · answered May 22 '21 at 23:39

This parameter corresponds to "Whether to use deterministic or stochastic actions". So the thing is when you are selecting an action according to given state, the actor_network gives you a probability distribution. For example for two possible actions a1 and a2: [0.25, 0.75]. If you use deterministic=True, the result will be action a2 since it has more probability. In the case of deterministic=False, the result action will be selected with given probabilities [0.25, 0.75].

So, basically Deterministic Policy when `deterministic=True` and Stochastic Policy otherwise? — chupa_kabra, Mar 14 '22 at 06:54

What does "deterministic=True" in stable baselines3 library means?

1 Answers1

Linked