I'm sure this has been answered, but I couldn't find anything that addressed my specific question.
I want to play with some reinforcement learning algorithms in a toy game. Given a particular game state, I want the network to predict action A. Given action A, I would then update the game state and call the network again on this new game state. This would repeat until the game ends.
I can't figure out, however, how to backpropagate the gradients through all of the network's predictions. Here's my (very rough) tensorflow pseudo code for how I imagine this would go:
for game_step in range(max_game_steps):
action = sess.run(predict_action, feed_dict={game_state: current_state})
current_state = update_state(current_state, action)
backpropagate_error_through_all_actions()
In this formulation, the gradients would only have context for their single action, and I can't send them through all of the states. I can't just run the network 30 times in a row, though, because I need to perform the state updates... What am I missing?
Do I need to model the entire toy game inside the tensorflow graph? That seems unreasonable.
Thanks!