Background:
I need to run a huge computation for climate simulation with over 800 [GB]
of data ( for the past 50 years and future 80 years ).
For this, I'm using RegCM4 based in linux. I am using Ubuntu. The most powerful system we have has some Intel XEON processor with 20 cores. Also we have almost 20 smaller less powerful Intel i7 octa-core processors.
To run the simulations, the single system will take more than a month.
So, I've been trying to set up computer clusters with available resources.
(FYI: RegCM allows parallel processing with mpi
.)
Specs::
Computer socket cores_per_socket threads_per_core CPUs RAM Hard_drives
node0 1 10 2 20 32 GB 256 GB + 2 TB-HDD
node1 1 4 2 8 8 GB 1 TB-HDD
node2 1 4 2 8 8 GB 1 TB-HDD
-> I use mpich
v3 ( I don't remember exact version no. )
And so on... ( all the nodes other than node0
are the same as node1
.)
All nodes have 1 Gbps
supported ethernet cards.
For test purpose I have set up a small simulation work for analyzing 6 days of climate. All test simulations used same parameters and model settings.
All nodes boot from their own HDD.
node0
runs on Ubuntu 16.04 LTS.
other nodes run Ubuntu 14.04 LTS.
How I started? I followed steps as in here.
- Connected
node1
andnode2
with a Cat 6 cable, assigned them static IP-s. (leftnode0
for now) - edited/etc/hosts
with IP-s and corresponding names -node1
andnode2
as given in table above - setup password-less login with ssh in both - success
- created a folder in
/home/user
innode1
(which will be master in this test) and exported the folder (/etc/exports
), mounted this folder overNFS
tonode2
and edited/etc/fstab
innode2
- success - Ran my
regcm
over the cluster using 14 cores of both machines - success - I have used :
iotop
,bmon
,htop
to monitor disk read/write, network traffic and CPU usage respectively.
$ mpirun -np 14 -hosts node0,node1 ./bin/regcmMPI test.in
Result of this test
Faster computation over a single node processing
Now I tried the same with node0
(see above for computer specs)
-> I am working on SSD in node0
.
-> works fine but the problem is time factor when connected in cluster.
Here's the summary of results::
- first using node0
only - no use of cluster
$ mpirun -np 20 ./bin/regcmMPI test.in
nodes no.of_cores_used run_time_reported_by_regcm_in_sec actual time taken in sec (approx)
node0 20 59.58 60
node0 16 65.35 66
node0 14 73.33 74
this is okay
Now, using cluster
( use following ref to understand the table below ):
rt
= CPU run time reported by regcm in sec
a-rt
= actual time taken in sec (approx)
LAN
= Max LAN speed achieved (Receive/Send) in MBps
disk(0 / 1)
= Max Disk write speed atnode0
/ atnode1
in MBps
nodes* cores rt a-rt LAN disk( 0 / 1 )
1,0 16 148 176 100/30 90 / 50
0,1 16 145 146 30/30 6 / 0
1,0 14 116 143 100/25 85 / 75
0,1 14 121 121 20/20 7 / 0
*note:
1,0 (eg. for 16 cores) means:
$ mpirun -np 16 -hosts node1,node0 ./bin/regcmMPI test.in
0,1 (eg. for 16 cores) means:
$ mpirun -np 16 -hosts node0,node1 ./bin/regcmMPI test.in
Actual run time was calculated manually using start and end time reported by regcm.
We can see above that LAN-usage and drive write speed was significantly different for two options - 1. passing node1,node0
as host ; and 2. passing node0,node1
as host ---- note the order.
Also time for running in single node is faster than running in cluster. Why ?
I also ran another set of test, this time using hostfile (named hostlist) whose content were:
node0:16
node1:6
Now I ran the following script
$ mpirun -np 22 -f hostlist ./bin/regcmMPI test.in
CPU run time was reported 101 [s]
, actual run time was 1 min 42 sec
( 102 [s]
), LAN speed achieved was around 10-15 [MB/s]
, disk write speed was around 7 [MB/s]
.
The best result was obtained when I used the same hostfile setting and ran code with 20 processors thus under-subscribing
$ mpirun -np 20 -f hostlist ./bin/regcmMPI test.in
CPU runtime : 90 [s]
Actual run time : 91 [s]
LAN : 10 [MB/s]
When I changed cores from 20 downto 18, run time increased to 102 [s]
.
I have not yet connected node2
to the system.
Questions:
- Is there a way to achieve faster speed in computation ? Am I doing something wrong ?
- The computation time for single machine with 14 cores is faster than cluster with 22 cores or 20 cores. Why is it happening ?
- What is the optimum number of cores that can be used to achieve time efficiency ?
- How can I achieve best performance with available resources?
- Are there any best mpich usage manual that can answer my questions? (I could not find any such info)
- Sometimes using fewer cores give faster completion time than using higher cores even though I am not using all available cores leaving 1 or 2 cores for OS and other operations in individual nodes. Why is it happening?