Is it possible to split a network across multiple GPUs in tensorflow?

Question

I plan to run a very large recurrent network (e.g. 2048x5), is it possible to define one layer at one GPU in tensorflow? How should I implement the model to achieve the best efficiency. I understand there is overhead for inter-GPU or GPU-CPU-GPU communication.

[here](https://www.tensorflow.org/versions/r0.7/how_tos/using_gpu/index.html#using_multiple_gpus) is the instructions. [here](https://www.tensorflow.org/versions/r0.7/tutorials/deep_cnn/index.html) is an example. Data parallel is much easier than functional parallel. — fluency03, Mar 30 '16 at 16:12
I understand the usage of with tf.device(). However, after I define layers on different GPUs, I find the gradients are still stored on the first GPU. Can you give a concrete example of splitting gradient computation on different GPUs? — read Read, Mar 30 '16 at 20:41
You might also try passing `colocate_gradients_with_ops=True` to the `optimizer.minimize()` method when building your model. — mrry, Mar 30 '16 at 20:43
@mrry It works! Now I am seeing the computation is evenly distributed. — read Read, Mar 30 '16 at 20:59
What about the case where you are applying `clip_by_norm` -- how do you ensure that each gpu clips their respective gradients so you are not wasting time transferring tensors back and forth? — LeavesBreathe, Oct 28 '16 at 19:04

score 20 · Accepted Answer · answered Mar 30 '16 at 21:21

20

Splitting a large model across multiple GPUs is certainly possible in TensorFlow, but doing it optimally is a hard research problem. In general, you will need to do the following:

Wrap large contiguous regions of your code in a with tf.device(...): block, naming the different GPUs:

with tf.device("/gpu:0"):
  # Define first layer.

with tf.device("/gpu:1"):
  # Define second layer.

# Define other layers, etc.

When building your optimizer, pass the optional argument colocate_gradients_with_ops=True to the optimizer.minimize() method:

loss = ...
optimizer = tf.train.AdaGradOptimizer(0.01)
train_op = optimizer.minimize(loss, colocate_gradients_with_ops=True)

(Optionally.) You may need to enable "soft placement" in the tf.ConfigProto when you create your tf.Session, if any of the operations in your model cannot run on GPU:
```
config = tf.ConfigProto(allow_soft_placement=True)
sess = tf.Session(config=config)
```

answered Mar 30 '16 at 21:21

mrry

125,488
26
399
400

1

I run my network on 2 GPUs, both forward and backward computation are distributed on 2 GPUs. However after a few hours' training, I find GPU utilization is really low. I find the queue occupancy (# batches in the queue) is 0, meaning queue is not filled quickly enough. I am using a thread to pump data into the queue. Should I explicitly define the queue, enqueue and dequeue operation on CPU? – read Read Mar 31 '16 at 21:32
Yes, we've found pinning the input pipeline to CPU to improve the overall performance of our model training (otherwise you get interference from the parts of the input pipeline that can run on CPU). – mrry Mar 31 '16 at 23:21
1

"Pinning the input pipeline to CPU", could you detail it a bit more please? – herve Oct 10 '16 at 18:27
You would use a `with tf.device("/cpu:0"):` block to wrap the construction of the ops in the input pipeline. – mrry Oct 10 '16 at 19:22
2

What's the difference between using this 'depth' approach instead of splitting the batch in smaller ones? I have the **feeling** this here is more memory-efficient as I don't have to replicate the same network to each GPU. If so, why did keras/tensorflow implemented the `towers`? – ldavid Nov 18 '17 at 23:46

Is it possible to split a network across multiple GPUs in tensorflow?

1 Answers1

Linked