1

what is the command to run distributed training on multiple nodes where each node has multiple GPUs. The example in https://github.com/tensorflow/models/tree/master/inception only shows the case where each node has 1 GPU/1 worker. In my cluster, each node has 4 GPUs which should require 4 workers.

I tried the following command: on node 0:

bazel-bin/inception/imagenet_distributed_train 
--batch_size=32 
--data_dir=$HOME/imagenet-data 
--job_name='worker' 
--task_id=0 
--ps_hosts='ps0.example.com:2222' 
--worker_hosts='worker0.example.com:2222,worker0.example.com:2222,worker0.example.com:2222,worker0.example.com:2222,worker1.example.com:2222,worker1.example.com:2222,worker1.example.com:2222,worker1.example.com:2222' &
......

bazel-bin/inception/imagenet_distributed_train 
--batch_size=32 
--data_dir=$HOME/imagenet-data 
--job_name='worker' 
--task_id=3 
--ps_hosts='ps0.example.com:2222' 
--worker_hosts='worker0.example.com:2222,worker0.example.com:2222,worker0.example.com:2222,worker0.example.com:2222,worker1.example.com:2222,worker1.example.com:2222,worker1.example.com:2222,worker1.example.com:2222'

on node 1:

bazel-bin/inception/imagenet_distributed_train 
--batch_size=32 
--data_dir=$HOME/imagenet-data 
--job_name='worker' 
--task_id=4 
--ps_hosts='ps0.example.com:2222' 
--worker_hosts='worker0.example.com:2222,worker0.example.com:2222,worker0.example.com:2222,worker0.example.com:2222,worker1.example.com:2222,worker1.example.com:2222,worker1.example.com:2222,worker1.example.com:2222' &
......

bazel-bin/inception/imagenet_distributed_train 
--batch_size=32 
--data_dir=$HOME/imagenet-data 
--job_name='worker' 
--task_id=7 
--ps_hosts='ps0.example.com:2222' 
--worker_hosts='worker0.example.com:2222,worker0.example.com:2222,worker0.example.com:2222,worker0.example.com:2222,worker1.example.com:2222,worker1.example.com:2222,worker1.example.com:2222,worker1.example.com:2222'

Note that there is & at the end of each command so that they can be executed in parallel, but it has out of GPU memory error.

I also tried to use only 1 worker in each node and each worker uses 4 GPU: on node 0:

bazel-bin/inception/imagenet_distributed_train 
--batch_size=32 
--data_dir=$HOME/imagenet-data 
--job_name='worker' 
--gpus=4
--task_id=0 
--ps_hosts='ps0.example.com:2222' 
--worker_hosts='worker0.example.com:2222,worker1.example.com:2222'

on node 1:

bazel-bin/inception/imagenet_distributed_train 
--batch_size=32 
--data_dir=$HOME/imagenet-data 
--job_name='worker' 
--gpus=4
--task_id=1 
--ps_hosts='ps0.example.com:2222' 
--worker_hosts='worker0.example.com:2222,worker1.example.com:2222'

But in the end each node only uses 1 GPU.

So what is the exact command I should use? Thanks.

silence_lamb
  • 377
  • 1
  • 3
  • 12
  • It sounds like you want 4 workers per node, each with a different [CUDA_VISIBLE_DEVICES](http://stackoverflow.com/a/34776814/6824418) set. – Allen Lavoie Apr 10 '17 at 17:12

0 Answers0