3

I'm implementing a Q-network as described in Human-level control through deep reinforcement learning (Mnih et al. 2015) in TensorFlow.

To approximate the Q-function they use a neural network. The Q-function maps a state and an action to a scalar value, known as the Q-value. I.e. it's a function like Q(s,a) = qvalue.

But instead of taking both state and action as input, they only take the state as input and output a vector with one element per legal action in a given order. Thus Q(s,a) becomes Q'(s) = array([val_a1, val_a2, val_a3,...]), where val_a1 is Q(s,a1).

This raises the question of how to modify the loss function. The loss function is a L2 loss function computed on the difference of a target (y) and Q(s,a).

My idea is to create a new TF operation and use a binary mask indicating which action I want to train on and multiply it with the output of the network. Effectively producing a vector like [0, 0, val_a3, 0, ...] if the action in question is a3.

And then feed the result of the new operation to the loss operation, which TF is then minimizing.

Questions:

  1. Is this a sound idea? Or is there a better way of solving this?

  2. How can this be solved with TensorFlow?

    There is a SO thread on something similar (Adjust Single Value within Tensor -- TensorFlow), but I would like to choose the column value with the help of a tf.placeholder that I can feed to the network at runtime. It doesn't seem to work when just replacing the static lists in that examples with placeholders.

Community
  • 1
  • 1
Skeppet
  • 931
  • 1
  • 9
  • 17

1 Answers1

2

There are a few implementations out there of doing DeepQ learning in TensorFlow that might be useful references to check out:

https://github.com/asrivat1/DeepLearningVideoGames

https://github.com/nivwusquorum/tensorflow-deepq

https://github.com/mrkulk/deepQN_tensorflow

I'm not sure what the best idea is without digging more deeply, but you can definitely apply a mask in a few different ways.

If you have your binary mask as a boolean vector e.g., [False, False, True, False] set up already, then you can do:

val_array = ...
binary_mask = tf.constant([False, False, True, False])
result = tf.select(binary_mask, val_array, tf.zeros_like(val_array))

This selects the entry from val_array whereever binary_mask is True, and zeros otherwise.

If your mask is not boolean but already a numeric type of the same type as val_array (e.g., 0.0s and 1.0s), then you can do a tf.mul(mask, val_array).

vrv
  • 411
  • 2
  • 6
  • So what the links you provided does is to have a placeholder for the action like `action_mask = tf.placeholder("float", [None, num_actions])`. And then they `masked_action = tf.mul(network_output, action_mask)`, followed by a `tf.reduce_sum(masked_action, reduction_index=[1,])`. That seems like a good idea. At least from what I can tell. – Skeppet Jan 22 '16 at 06:41