I am trying to apply the PPO algorithm from the stable baselines3 library https://stable-baselines3.readthedocs.io/en/master/ to a custom environment I made.
One thing I don't understand is the following line:
mean_reward, std_reward = evaluate_policy(model, env, n_eval_episodes=10, deterministic=True)
Should I always let deterministic equal True? When I keep deterministic="True", my custom environment "somehow" is always solved (i.e., always returning reward of 1 +/- 0 std).
And when I change it to "False", it starts behaving in a reasonable way (i.e., sometimes it succeeds (reward=1) and sometimes it fails (reward=0).