I am using openMPI-1.6 on a cluster, which has 8 nodes and each node has 8 cores. I am using this command to run my application.
/path/to/mpirun --mca btl self,sm,tcp --hostfile $PBS_NODEFILE -np $num_core /path/to/application
I have run experiments and got the following data.
num node | num core per node | total core | exe time of application |
1 2 2 8.5 sec
1 4 4 5.3 sec
1 8 8 7.1 sec
2 1 2 11 sec
2 2 4 9.5 sec
2 4 8 44 sec //this is too slow
As you can see the execution time of the last row (2 nodes, 8 cores) is too slower than the others. I assumed an overhead using more than one node but I didn't expect this exponential degradation.
So, my question is that is there any openMPI performance parameters I am missing to run jobs on a cluster using more than one node? I assumed mca btl self,sm,tcp
parameter automatically uses the shared-memory option for the communication inside of a node and will use 'tcp' option for the communication sent to the outside of a node. Do I understand it correctly?
I know it is hard to tell without knowing the application but I am asking a general parameter tuning which should be independent to the application.