I write a custom gym environment, and trained with PPO provided by stable-baselines3. The ep_rew_mean
recorded by tensorboard is as follow:
the ep_rew_mean curve for total 100 million steps, each episode has 50 steps
As shown in the figure, the reward is around 15.5 after training, and the model converges. However, I use the function evaluate_policy()
for the trained model, and the reward is much smaller than the ep_rew_mean
value. The first value is mean reward, the second value is std of reward:
4.349947246664763 1.1806464511030819
the way I use function evaluate_policy()
is:
mean_reward, std_reward = evaluate_policy(model, env, n_eval_episodes=10000)
According to my understanding, the initial environment is randomly distributed in an area when using reset()
fuction, so there should not be overfitting problem.
I have also tried different learning rate or other parameters, and this problem is not solved.
I have checked my environment, and I think there is no error.
I have searched on the internet, read the doc of stable-baselines3 and issues on github, but did not find the solution.