Recently, I've decided to apply some reinforcement learning and deep Q learning I've learned to the LunarLander environment by OpenAI.
My algorithm is just Deep Q-learning with experience replay and I wanted to be able to save the model/agent and then load up the model on its own and make it just interact with the environment without any fitting/training of its weights. I had chosen to save a few models using q_network.save(directory+"lunar_model_score{}.h5".format(accum_reward))
at the end of each iteration/episode with the highest consecutive scores and low epsilon value (so that the model is doing more predicting than exploring) during training.
However when I try to load the model again elsewhere and try to run the model just in the environment without training, it performs very poorly as if it had not been trained, code for testing:
import gym
import tensorflow as tf
import matplotlib.pyplot as plt
import numpy as np
env = gym.make('LunarLander-v2')
action_space = env.action_space.n
state_space = env.observation_space.shape[0]
lunar_agent = tf.keras.models.load_model('C:/Users/haora/gymEnv/LunarLand/models/lunar_model_score215.65755254109038.h5')
file_name = 'lunarLand_test_data.txt'
datafile = open(file_name,"w+")
episodes = 10
lunar_agent.summary()
#print(lunar_agent.get_weights())
for e in range(episodes):
state = env.reset()
accum_reward = 0
while True:
env.render()
state = np.reshape(state,[1,state_space])
prediction = lunar_agent.predict(state)
action = np.argmax(prediction[0])
next_state, reward, done, _ = env.step(action)
accum_reward += reward
if done:
break
print("episode:{}/{} | score:{}".format(e,episodes,accum_reward))
datafile.write(str(e)+','+str(accum_reward)+'\n')
env.close()
datafile.close()
I've verified that the weight values and architecture I saved in the training file is the same as the weights i got when I called print(lunar_agent.get_weights())
, so I was wondering why there is such a big discrepency between the model when training and the model when only interacting with the environment and how to fix it so that I can run different models at different iterations of training and make the agent perform accordingly when only interacting with the environment.