I'm implementing a Q-network as described in Human-level control through deep reinforcement learning (Mnih et al. 2015) in TensorFlow.
To approximate the Q-function they use a neural network. The Q-function maps a state and an action to a scalar value, known as the Q-value. I.e. it's a function like Q(s,a) = qvalue.
But instead of taking both state and action as input, they only take the state as input and output a vector with one element per legal action in a given order. Thus Q(s,a) becomes Q'(s) = array([val_a1, val_a2, val_a3,...])
, where val_a1
is Q(s,a1).
This raises the question of how to modify the loss function. The loss function is a L2 loss function computed on the difference of a target (y) and Q(s,a).
My idea is to create a new TF operation and use a binary mask indicating which action I want to train on and multiply it with the output of the network. Effectively producing a vector like [0, 0, val_a3, 0, ...]
if the action in question is a3
.
And then feed the result of the new operation to the loss operation, which TF is then minimizing.
Questions:
Is this a sound idea? Or is there a better way of solving this?
How can this be solved with TensorFlow?
There is a SO thread on something similar (Adjust Single Value within Tensor -- TensorFlow), but I would like to choose the column value with the help of a
tf.placeholder
that I can feed to the network at runtime. It doesn't seem to work when just replacing the static lists in that examples with placeholders.