I am trying to implement the deep q learning programs DeepMind used to train an AI to play Atari games. One of the features they use and is mentioned in multiple tutorials is to have two versions of your neural network; one to update as you cycle through mini-batch training data (call this Q), and one to call as your doing this to help construct the training data (Q'). Then periodically (say every 10k data points) the weights in Q' get set to the current values of Q.
My question is what is the best way to do this in TensorFlow? Both to store two identical architecture networks at the same time, and to periodically update ones weights from the other? My current net is shown below and is currently just using the default graph and an interactive session.
sess = tf.InteractiveSession()
x = tf.placeholder(tf.float32, shape=[None, height, width, m])
y_ = tf.placeholder(tf.float32, shape=[None, env.action_space.n])
W_conv1 = weight_variable([8, 8, 4, 32])
b_conv1 = bias_variable([32])
h_conv1 = tf.nn.relu(conv2d(x, W_conv1, 4, 4) + b_conv1)
W_conv2 = weight_variable([4, 4, 32, 64])
b_conv2 = bias_variable([64])
h_conv2 = tf.nn.relu(conv2d(h_conv1, W_conv2, 2, 2) + b_conv2)
W_conv3 = weight_variable([3, 3, 64, 64])
b_conv3 = bias_variable([64])
h_conv3 = tf.nn.relu(conv2d(h_conv2, W_conv3, 1, 1) + b_conv3)
# Flattern conv to dense
flat_input_size = 14*10*64
h_conv3_reshape = tf.reshape(h_conv3, [-1, flat_input_size])
# Dense layers
W_fc1 = weight_variable([flat_input_size, 512])
b_fc1 = bias_variable([512])
h_fc1 = tf.nn.relu(tf.matmul(h_conv3_reshape, W_fc1) + b_fc1)
W_fc2 = weight_variable([512, env.action_space.n])
b_fc2 = bias_variable([env.action_space.n])
y_conv = tf.matmul(h_fc1, W_fc2) + b_fc2
accuracy = tf.squared_difference(y_, y_conv)
loss = tf.reduce_mean(accuracy)
optimizer = tf.train.AdamOptimizer(0.0001).minimize(loss)
tf.global_variables_initializer().run()