How to execute distributed training where each node has multiple workers

Question

what is the command to run distributed training on multiple nodes where each node has multiple GPUs. The example in https://github.com/tensorflow/models/tree/master/inception only shows the case where each node has 1 GPU/1 worker. In my cluster, each node has 4 GPUs which should require 4 workers.

I tried the following command: on node 0:

bazel-bin/inception/imagenet_distributed_train 
--batch_size=32 
--data_dir=$HOME/imagenet-data 
--job_name='worker' 
--task_id=0 
--ps_hosts='ps0.example.com:2222' 
--worker_hosts='worker0.example.com:2222,worker0.example.com:2222,worker0.example.com:2222,worker0.example.com:2222,worker1.example.com:2222,worker1.example.com:2222,worker1.example.com:2222,worker1.example.com:2222' &
......

bazel-bin/inception/imagenet_distributed_train 
--batch_size=32 
--data_dir=$HOME/imagenet-data 
--job_name='worker' 
--task_id=3 
--ps_hosts='ps0.example.com:2222' 
--worker_hosts='worker0.example.com:2222,worker0.example.com:2222,worker0.example.com:2222,worker0.example.com:2222,worker1.example.com:2222,worker1.example.com:2222,worker1.example.com:2222,worker1.example.com:2222'

on node 1:

bazel-bin/inception/imagenet_distributed_train 
--batch_size=32 
--data_dir=$HOME/imagenet-data 
--job_name='worker' 
--task_id=4 
--ps_hosts='ps0.example.com:2222' 
--worker_hosts='worker0.example.com:2222,worker0.example.com:2222,worker0.example.com:2222,worker0.example.com:2222,worker1.example.com:2222,worker1.example.com:2222,worker1.example.com:2222,worker1.example.com:2222' &
......

bazel-bin/inception/imagenet_distributed_train 
--batch_size=32 
--data_dir=$HOME/imagenet-data 
--job_name='worker' 
--task_id=7 
--ps_hosts='ps0.example.com:2222' 
--worker_hosts='worker0.example.com:2222,worker0.example.com:2222,worker0.example.com:2222,worker0.example.com:2222,worker1.example.com:2222,worker1.example.com:2222,worker1.example.com:2222,worker1.example.com:2222'

Note that there is & at the end of each command so that they can be executed in parallel, but it has out of GPU memory error.

I also tried to use only 1 worker in each node and each worker uses 4 GPU: on node 0:

bazel-bin/inception/imagenet_distributed_train 
--batch_size=32 
--data_dir=$HOME/imagenet-data 
--job_name='worker' 
--gpus=4
--task_id=0 
--ps_hosts='ps0.example.com:2222' 
--worker_hosts='worker0.example.com:2222,worker1.example.com:2222'

on node 1:

bazel-bin/inception/imagenet_distributed_train 
--batch_size=32 
--data_dir=$HOME/imagenet-data 
--job_name='worker' 
--gpus=4
--task_id=1 
--ps_hosts='ps0.example.com:2222' 
--worker_hosts='worker0.example.com:2222,worker1.example.com:2222'

But in the end each node only uses 1 GPU.

So what is the exact command I should use? Thanks.

It sounds like you want 4 workers per node, each with a different [CUDA_VISIBLE_DEVICES](http://stackoverflow.com/a/34776814/6824418) set. — Allen Lavoie, Apr 10 '17 at 17:12

How to execute distributed training where each node has multiple workers

0 Answers0