I plan to run a very large recurrent network (e.g. 2048x5), is it possible to define one layer at one GPU in tensorflow? How should I implement the model to achieve the best efficiency. I understand there is overhead for inter-GPU or GPU-CPU-GPU communication.
Asked
Active
Viewed 5,894 times
9
-
1[here](https://www.tensorflow.org/versions/r0.7/how_tos/using_gpu/index.html#using_multiple_gpus) is the instructions. [here](https://www.tensorflow.org/versions/r0.7/tutorials/deep_cnn/index.html) is an example. Data parallel is much easier than functional parallel. – fluency03 Mar 30 '16 at 16:12
-
I understand the usage of with tf.device(). However, after I define layers on different GPUs, I find the gradients are still stored on the first GPU. Can you give a concrete example of splitting gradient computation on different GPUs? – read Read Mar 30 '16 at 20:41
-
2You might also try passing `colocate_gradients_with_ops=True` to the `optimizer.minimize()` method when building your model. – mrry Mar 30 '16 at 20:43
-
1@mrry It works! Now I am seeing the computation is evenly distributed. – read Read Mar 30 '16 at 20:59
-
What about the case where you are applying `clip_by_norm` -- how do you ensure that each gpu clips their respective gradients so you are not wasting time transferring tensors back and forth? – LeavesBreathe Oct 28 '16 at 19:04
1 Answers
20
Splitting a large model across multiple GPUs is certainly possible in TensorFlow, but doing it optimally is a hard research problem. In general, you will need to do the following:
Wrap large contiguous regions of your code in a
with tf.device(...):
block, naming the different GPUs:with tf.device("/gpu:0"): # Define first layer. with tf.device("/gpu:1"): # Define second layer. # Define other layers, etc.
When building your optimizer, pass the optional argument
colocate_gradients_with_ops=True
to theoptimizer.minimize()
method:loss = ... optimizer = tf.train.AdaGradOptimizer(0.01) train_op = optimizer.minimize(loss, colocate_gradients_with_ops=True)
(Optionally.) You may need to enable "soft placement" in the
tf.ConfigProto
when you create yourtf.Session
, if any of the operations in your model cannot run on GPU:config = tf.ConfigProto(allow_soft_placement=True) sess = tf.Session(config=config)

mrry
- 125,488
- 26
- 399
- 400
-
1I run my network on 2 GPUs, both forward and backward computation are distributed on 2 GPUs. However after a few hours' training, I find GPU utilization is really low. I find the queue occupancy (# batches in the queue) is 0, meaning queue is not filled quickly enough. I am using a thread to pump data into the queue. Should I explicitly define the queue, enqueue and dequeue operation on CPU? – read Read Mar 31 '16 at 21:32
-
Yes, we've found pinning the input pipeline to CPU to improve the overall performance of our model training (otherwise you get interference from the parts of the input pipeline that can run on CPU). – mrry Mar 31 '16 at 23:21
-
1"Pinning the input pipeline to CPU", could you detail it a bit more please? – herve Oct 10 '16 at 18:27
-
You would use a `with tf.device("/cpu:0"):` block to wrap the construction of the ops in the input pipeline. – mrry Oct 10 '16 at 19:22
-
2What's the difference between using this 'depth' approach instead of splitting the batch in smaller ones? I have the **feeling** this here is more memory-efficient as I don't have to replicate the same network to each GPU. If so, why did keras/tensorflow implemented the `towers`? – ldavid Nov 18 '17 at 23:46