2

I write a custom gym environment, and trained with PPO provided by stable-baselines3. The ep_rew_mean recorded by tensorboard is as follow:

the ep_rew_mean curve for total 100 million steps, each episode has 50 steps

As shown in the figure, the reward is around 15.5 after training, and the model converges. However, I use the function evaluate_policy() for the trained model, and the reward is much smaller than the ep_rew_mean value. The first value is mean reward, the second value is std of reward:

4.349947246664763 1.1806464511030819

the way I use function evaluate_policy() is:

mean_reward, std_reward = evaluate_policy(model, env, n_eval_episodes=10000)

According to my understanding, the initial environment is randomly distributed in an area when using reset() fuction, so there should not be overfitting problem.

I have also tried different learning rate or other parameters, and this problem is not solved.

I have checked my environment, and I think there is no error.

I have searched on the internet, read the doc of stable-baselines3 and issues on github, but did not find the solution.

Aramiis
  • 21
  • 2
  • was your env wrapped with a Monitor or any other rescaling wrappers during the training? SB3 often does it in background before training, while `evaluate_policy` takes unscaled values from `env.step`. – gehirndienst Feb 06 '23 at 08:58
  • Thanks to your reply. My env is not wrapped with a monitor. I didn't notice this, I will check it later. So the `evaluate_policy` gets the true value of reward I get from the model? – Aramiis Feb 06 '23 at 10:16
  • I have wrapped my env with a monitor, and retrained the model, didn't notice the reward was rescaled. wrap a monitor before using `evaluate_policy` doesn't change the reward as well. My env have fixed number of steps per episode, so I guess the monitor is not the problem. – Aramiis Feb 07 '23 at 11:15

1 Answers1

0

evaluate_policy has deterministic to True by default (https://stable-baselines3.readthedocs.io/en/master/common/evaluation.html).

If you sample from the distribution during training, it may help to evaluate the policy without it selecting the actions with an argmax (by passing in deterministic=False).

tacon
  • 321
  • 1
  • 6
  • Could be the case if author had observed the opposite, so that a reward by `evaluate_policy ` would have been too good. But it is the other way round. I would run `evaluate_policy` with `return_episode_rewards=True` and see how rewards behave – gehirndienst Feb 07 '23 at 09:15
  • @tacon I set `deterministic=False`, the reward increased a little, but still far less than the reward in training. @gehirndienst I think the `return_episode_rewards=True` has the same result of wrapping a monitor. – Aramiis Feb 07 '23 at 11:27