Deep Q-Learning model performs very poorly when it is loaded versus how it performed when training

Question

Recently, I've decided to apply some reinforcement learning and deep Q learning I've learned to the LunarLander environment by OpenAI.

My algorithm is just Deep Q-learning with experience replay and I wanted to be able to save the model/agent and then load up the model on its own and make it just interact with the environment without any fitting/training of its weights. I had chosen to save a few models using q_network.save(directory+"lunar_model_score{}.h5".format(accum_reward)) at the end of each iteration/episode with the highest consecutive scores and low epsilon value (so that the model is doing more predicting than exploring) during training. However when I try to load the model again elsewhere and try to run the model just in the environment without training, it performs very poorly as if it had not been trained, code for testing:

import gym
import tensorflow as tf
import matplotlib.pyplot as plt
import numpy as np
env = gym.make('LunarLander-v2')
action_space = env.action_space.n
state_space = env.observation_space.shape[0]

lunar_agent = tf.keras.models.load_model('C:/Users/haora/gymEnv/LunarLand/models/lunar_model_score215.65755254109038.h5')
file_name = 'lunarLand_test_data.txt'
datafile = open(file_name,"w+")
episodes = 10
lunar_agent.summary()
#print(lunar_agent.get_weights())

for e in range(episodes):
    state = env.reset()
    accum_reward = 0
    while True:
        env.render()
        state = np.reshape(state,[1,state_space])
        prediction = lunar_agent.predict(state)
        action = np.argmax(prediction[0])
        next_state, reward, done, _ = env.step(action)
        accum_reward += reward

        if done:
            break
    print("episode:{}/{} | score:{}".format(e,episodes,accum_reward))
    datafile.write(str(e)+','+str(accum_reward)+'\n')

env.close()
datafile.close()

I've verified that the weight values and architecture I saved in the training file is the same as the weights i got when I called print(lunar_agent.get_weights()), so I was wondering why there is such a big discrepency between the model when training and the model when only interacting with the environment and how to fix it so that I can run different models at different iterations of training and make the agent perform accordingly when only interacting with the environment.

It seems that you know what you are doing so let's see if I can make a decent contribution here. Going through your code, indeed load_model returns compiled model ready to be used "unless the saved model was never compiled in the first place". So, what might be worth doing is playing around a bit with the epsilon value, since your model might be getting stuck in a local minimum. Therefore, maybe try bumping it up a bit. https://stackoverflow.com/questions/53198503/epsilon-and-learning-rate-decay-in-epsilon-greedy-q-learning — Konstantin Grigorov, Jan 04 '20 at 18:56
Its just that when i exported the model during training, it was at a point where the epsilon value was small enough that it shouldnt have affected the performance of the model too much. Your answer makes sense, but I feel like that would help towards improving the model to obtain the best score possible whereas my concern is why my imported model with the same weights perform extremely poorly as if it had never been trained before. — Jayce, Jan 08 '20 at 16:16

Deep Q-Learning model performs very poorly when it is loaded versus how it performed when training

0 Answers0