9

I'm currently working on a system with 2 GPUs each of 12GB. I want to implement model parallelism across the two GPUs to train large models. I have been looking through all over the internet, SO, tensorflow documentation, etc, i was able to find the explanations of model parallelism and its results but nowhere did i find a small tutorial or small code snippets on how to implement it using tensorflow. I mean we have to exchange activations after every layer right? So how do we do that? Is there a specific or cleaner ways of implementing model parallelism in tensorflow? It would be very helpful if you could suggest me a place where i can learn to implement it or a simple code like mnist training on multiple GPU using 'MODEL PARALLELISM'.

Note: I have done data parallelism like in CIFAR10 - multi gpu tutorial but i haven't found any implementation of model parallelism.

Sascha Kirch
  • 466
  • 2
  • 3
  • 19
krish567
  • 103
  • 1
  • 7

1 Answers1

13

Here's an example. The model has some parts on GPU0, some parts on GPU1 and some parts on CPU, so this is 3 way model parallelism.

with tf.device("/gpu:0"):
    a = tf.Variable(tf.ones(()))
    a = tf.square(a)
with tf.device("/gpu:1"):
    b = tf.Variable(tf.ones(()))
    b = tf.square(b)
with tf.device("/cpu:0"):
    loss = a+b
opt = tf.train.GradientDescentOptimizer(learning_rate=0.1)
train_op = opt.minimize(loss)

sess = tf.Session()
sess.run(tf.global_variables_initializer())
for i in range(10):
    loss0, _ = sess.run([loss, train_op])
    print("loss", loss0)
Yaroslav Bulatov
  • 57,332
  • 22
  • 139
  • 197
  • @Buatov - Thanks for the above implementation, but what i was looking for is that when implementing model parallelism for a neural network we will have to share the activations after forward pass through every layer, so my main doubt was how do i pass the weight matrices from one device to the other (i.e, between GPU's).. – krish567 Feb 09 '17 at 07:09
  • That's all done automatically for you. IE, just replace `tf.square` part in the code above with `a=create_left_part_of_network` and `b=create_right_part_of_network`, and you'll end up with a network partitioned between gpu0 and gpu1 – Yaroslav Bulatov Feb 09 '17 at 15:10
  • It is working as i expected it to work but it is slower than the time it takes if i run everything in one gpu. Do you know why is it happening. here is the link to the codes: [multi_gpu](https://github.com/krish567/deep_learning/blob/master/model3.py) [one gpu](https://github.com/krish567/deep_learning/blob/master/model2.py) – krish567 Feb 10 '17 at 12:42
  • you have to look at timelines and check what's the bottleneck – Yaroslav Bulatov Feb 10 '17 at 17:18
  • Thank you very much for the help, it is faster but the problem i was facing was the multi-gpu program is taking double the memory footprint than running the entire program in one GPU. Can you think of any reason why this might be happening? The links to the codes are above. I will update them with the latest codes – krish567 Feb 11 '17 at 04:38
  • hey, i don't know why but my program is taking double the memory when running on 2 gpus compared to when running on one gpu alone.. Can you help me with debugging why this issue might be. one gpu - memory usage : 8.5GB two gpu's - memory usage : 17GB – krish567 Feb 14 '17 at 14:26
  • @krish567 that's normal -- tensorflow allocates all memory by default, so if you give it 2x GPUs, it'll allocate 2x memory – Yaroslav Bulatov Feb 14 '17 at 21:07
  • Does it happen like that even if I set allow_growth option to true while running the session? – krish567 Feb 18 '17 at 18:05
  • With `allow_growth` it should be better, although it'll still end up holding memory that it's not using, so `nvidia-smi` will give incorrect value. See https://github.com/yaroslavvb/memory_probe_ops for a way to get an actual memory usage – Yaroslav Bulatov Feb 18 '17 at 18:08
  • @YaroslavBulatov Hi, I met the same problem. I do some image recognition problem. Say that I can run the whole model with batch size 16 in one GPU. But when I split the model to two parts, distributed nearly equally in two GPUs, I can just increase the batch size to 18 due to GPU memory, which is supposed to be close to 32. I think I missed something about tensorflow using GPU memory. Any idea? Thanks. – LI Xuhong Mar 01 '17 at 23:12
  • @Seven If all parameters and activations are decreased by 50%, then achievable peak memory should also decrease by 50%. However, TensorFlow execution order is non-deterministic, so may start using an execution order which is inefficient. I use this utility to force a memory efficient execution order -- https://github.com/yaroslavvb/stuff/tree/master/linearize – Yaroslav Bulatov Mar 02 '17 at 00:25
  • I was wondering if with model parallelism, Tensorflow feeds new data to a layer A on GPU0 when it finished calculating the previous data batch for layer A, while layer B on GPU1 is using the previous output of layer A? Or will computation halt until the data has propagated through all layers (distributed over multiple devices)? – Visionscaper Sep 01 '18 at 23:20