what is the command to run distributed training on multiple nodes where each node has multiple GPUs. The example in https://github.com/tensorflow/models/tree/master/inception only shows the case where each node has 1 GPU/1 worker. In my cluster, each node has 4 GPUs which should require 4 workers.
I tried the following command: on node 0:
bazel-bin/inception/imagenet_distributed_train
--batch_size=32
--data_dir=$HOME/imagenet-data
--job_name='worker'
--task_id=0
--ps_hosts='ps0.example.com:2222'
--worker_hosts='worker0.example.com:2222,worker0.example.com:2222,worker0.example.com:2222,worker0.example.com:2222,worker1.example.com:2222,worker1.example.com:2222,worker1.example.com:2222,worker1.example.com:2222' &
......
bazel-bin/inception/imagenet_distributed_train
--batch_size=32
--data_dir=$HOME/imagenet-data
--job_name='worker'
--task_id=3
--ps_hosts='ps0.example.com:2222'
--worker_hosts='worker0.example.com:2222,worker0.example.com:2222,worker0.example.com:2222,worker0.example.com:2222,worker1.example.com:2222,worker1.example.com:2222,worker1.example.com:2222,worker1.example.com:2222'
on node 1:
bazel-bin/inception/imagenet_distributed_train
--batch_size=32
--data_dir=$HOME/imagenet-data
--job_name='worker'
--task_id=4
--ps_hosts='ps0.example.com:2222'
--worker_hosts='worker0.example.com:2222,worker0.example.com:2222,worker0.example.com:2222,worker0.example.com:2222,worker1.example.com:2222,worker1.example.com:2222,worker1.example.com:2222,worker1.example.com:2222' &
......
bazel-bin/inception/imagenet_distributed_train
--batch_size=32
--data_dir=$HOME/imagenet-data
--job_name='worker'
--task_id=7
--ps_hosts='ps0.example.com:2222'
--worker_hosts='worker0.example.com:2222,worker0.example.com:2222,worker0.example.com:2222,worker0.example.com:2222,worker1.example.com:2222,worker1.example.com:2222,worker1.example.com:2222,worker1.example.com:2222'
Note that there is & at the end of each command so that they can be executed in parallel, but it has out of GPU memory error.
I also tried to use only 1 worker in each node and each worker uses 4 GPU: on node 0:
bazel-bin/inception/imagenet_distributed_train
--batch_size=32
--data_dir=$HOME/imagenet-data
--job_name='worker'
--gpus=4
--task_id=0
--ps_hosts='ps0.example.com:2222'
--worker_hosts='worker0.example.com:2222,worker1.example.com:2222'
on node 1:
bazel-bin/inception/imagenet_distributed_train
--batch_size=32
--data_dir=$HOME/imagenet-data
--job_name='worker'
--gpus=4
--task_id=1
--ps_hosts='ps0.example.com:2222'
--worker_hosts='worker0.example.com:2222,worker1.example.com:2222'
But in the end each node only uses 1 GPU.
So what is the exact command I should use? Thanks.