RegCM, MPICH, Computer Clustering

Question

Background: I need to run a huge computation for climate simulation with over 800 [GB] of data ( for the past 50 years and future 80 years ).

For this, I'm using RegCM4 based in linux. I am using Ubuntu. The most powerful system we have has some Intel XEON processor with 20 cores. Also we have almost 20 smaller less powerful Intel i7 octa-core processors.

To run the simulations, the single system will take more than a month.

So, I've been trying to set up computer clusters with available resources.
(FYI: RegCM allows parallel processing with mpi.)

Specs::

Computer socket cores_per_socket threads_per_core CPUs   RAM   Hard_drives 
node0    1      10               2                20   32 GB   256 GB  + 2 TB-HDD
node1    1       4               2                 8    8 GB             1 TB-HDD
node2    1       4               2                 8    8 GB             1 TB-HDD

-> I use mpich v3 ( I don't remember exact version no. )

And so on... ( all the nodes other than node0 are the same as node1.)
All nodes have 1 Gbps supported ethernet cards.
For test purpose I have set up a small simulation work for analyzing 6 days of climate. All test simulations used same parameters and model settings.

All nodes boot from their own HDD.
node0 runs on Ubuntu 16.04 LTS.
other nodes run Ubuntu 14.04 LTS.

How I started? I followed steps as in here.

Connected node1 and node2 with a Cat 6 cable, assigned them static IP-s. (left node0 for now) - edited /etc/hosts with IP-s and corresponding names - node1 and node2 as given in table above
setup password-less login with ssh in both - success
created a folder in /home/user in node1 (which will be master in this test) and exported the folder ( /etc/exports ), mounted this folder over NFS to node2 and edited /etc/fstab in node2 - success
Ran my regcm over the cluster using 14 cores of both machines - success
I have used : iotop, bmon, htop to monitor disk read/write, network traffic and CPU usage respectively.

$ mpirun -np 14 -hosts node0,node1 ./bin/regcmMPI test.in

Result of this test
Faster computation over a single node processing

Now I tried the same with node0 (see above for computer specs)

-> I am working on SSD in node0.
-> works fine but the problem is time factor when connected in cluster.

Here's the summary of results:: - first using node0 only - no use of cluster

$ mpirun -np 20 ./bin/regcmMPI test.in

nodes   no.of_cores_used    run_time_reported_by_regcm_in_sec   actual time taken in sec (approx)
node0   20                  59.58                                60
node0   16                  65.35                                66
node0   14                  73.33                                74

this is okay

Now, using cluster
_{( use following ref to understand the table below )}:

rt = CPU run time reported by regcm in sec

a-rt = actual time taken in sec (approx)

LAN = Max LAN speed achieved (Receive/Send) in MBps

disk(0 / 1) = Max Disk write speed at node0 / at node1 in MBps

nodes*  cores   rt      a-rt    LAN     disk(  0 /  1 )
1,0    16       148     176     100/30        90 / 50
0,1    16       145     146      30/30         6 /  0
1,0    14       116     143     100/25        85 / 75
0,1    14       121     121      20/20         7 /  0

*note:

1,0 (eg. for 16 cores) means: $ mpirun -np 16 -hosts node1,node0 ./bin/regcmMPI test.in

0,1 (eg. for 16 cores) means: $ mpirun -np 16 -hosts node0,node1 ./bin/regcmMPI test.in

Actual run time was calculated manually using start and end time reported by regcm.

We can see above that LAN-usage and drive write speed was significantly different for two options - 1. passing node1,node0 as host ; and 2. passing node0,node1 as host ---- note the order.

Also time for running in single node is faster than running in cluster. Why ?

I also ran another set of test, this time using hostfile (named hostlist) whose content were:

node0:16
node1:6

Now I ran the following script

$ mpirun -np 22 -f hostlist ./bin/regcmMPI test.in

CPU run time was reported 101 [s], actual run time was 1 min 42 sec ( 102 [s] ), LAN speed achieved was around 10-15 [MB/s], disk write speed was around 7 [MB/s].

The best result was obtained when I used the same hostfile setting and ran code with 20 processors thus under-subscribing

$ mpirun -np 20 -f hostlist ./bin/regcmMPI test.in

CPU runtime     : 90 [s]
Actual run time : 91 [s]
LAN             : 10 [MB/s]

When I changed cores from 20 downto 18, run time increased to 102 [s].

I have not yet connected node2 to the system.

Questions:

Is there a way to achieve faster speed in computation ? Am I doing something wrong ?
The computation time for single machine with 14 cores is faster than cluster with 22 cores or 20 cores. Why is it happening ?
What is the optimum number of cores that can be used to achieve time efficiency ?
How can I achieve best performance with available resources?
Are there any best mpich usage manual that can answer my questions? (I could not find any such info)
Sometimes using fewer cores give faster completion time than using higher cores even though I am not using all available cores leaving 1 or 2 cores for OS and other operations in individual nodes. Why is it happening?

There are many questions in one a complicated topic, answering the questions specifically would only be possible with in-dephth knowledge about RegCM. It sounds that you might be in academia - I would recommend you to contact your regional HPC center. They have much more suitable hardware resources and know how to use them efficiently. — Zulan, Dec 01 '17 at 16:10
Generally: I strongly advise to avoid using heterogeneous systems (e.g. nodes with different OS versions, CPUs). In your setup: ditch node0. — Zulan, Dec 01 '17 at 16:12
thank you for your response, @Zulan of course, indepth knowledge aboout RegCM would be better for answering the question(s). I am running these simulations for my thesis. And, I don't think I can get into HPC center, I have no knowledge of any such center in Nepal. I used node0 because it was the best available system and I certainly wanted to make use of the resource available. Also I believed adding another TB of SSD would make writing into disk faster and hence reduce time consumption. — anup, Dec 04 '17 at 05:27
Besides I also tried to make answering to my questions simpler by eliminating RegCM from the equation. I believe the solutions I'm looking for the problems I've faced is more related to physical layout of cluster and specifically mpi. Thank you for taking your time. — anup, Dec 04 '17 at 05:33

user3666197 · Answer 1 · 2017-12-08T17:55:44.683

_{While the above commented advice to contact a regional or a national HPC centre is fair and worth to follow, I can imagine, how hard it could get to receive some remarkable processing-quota granted, if both deadlines and budget are moving against you}

INTRO:
Simplified Answers to Questions on a yet hidden complex-system:

1:
Is there a way to achieve faster speed in computation ?
Yes.
Am I doing something wrong ?
Not directly.

2:
The computation time for single machine with 14 cores is faster than cluster with 22 cores or 20 cores. Why is it happening ?
You pay more than you get. That easy. The NFS - a network-distributed abstraction of a filesystem is possible, but you pay immense costs for an ease to use it, if performance starts to become an ultimate target. In general all the lumpsum of all sorts of extra-costs paid on ( data distribution + the high add-on overheads ) is any higher than a net-effect of the [PARALLEL]-distributable workload-split over a yet low amount of CPU_cores, an actual slowdown instead of speedup appears. This is a common main suspect ( not mentioning switching off the hyper-threading in BIOS per se for computing intensive workloads ).

3:
What is the optimum number of cores that can be used to achieve time efficiency ?
First identify the biggest process-bottleneck observed{ CPU | MEMORY | LAN | fileIO | a-poor-algorithm }, only next seek a best step to improve speed ( keep iterative moving forwards on this { cause: remedy }-chain, while performance still grows ). Never attempt to go in a reversed order.

4:
How can I achieve best performance with available resources ?
This one is the most interesting and will require more work to be done ( ref. below ).

5:
Are there any best mpich usage manual that can answer my questions ?
There is no such colour of a LAN-cable, that would decide about its actual speed and performance or that would assure its suitability for some specific use, but the overall system architecture does matter.

6:
Sometimes using fewer cores give faster completion time than using higher cores even though I am not using all available cores leaving 1 or 2 cores for OS and other operations in individual nodes. Why is it happening ?
Ref. [ item 2 above ]

SOLUTION:
What one can always do about this design dilemma?

Before doing any step further, do your best to try to well understand both the original Amdahl's Law + its new overhead-strict re-formulation

Without mastering this basis, nothing else will help you decide the performance-hunt-dilemma-( duality )-of-fair-accounting-of-both-{ -costs +benefits }

A narrow view:
prefer tests to guesses. ( +1 for running tests )
mpich is a tool for your code going distributed and for all the related process-management and synchronisations. While weather simulations may enjoy a well established locality of influence ( less inter-process communication and synchronisations are mandatory, but the actual code decides on how many actually do appear to happen ) still the costs of data-related transfers will dominate ( ref. below for orders of magnitude ). If you cannot modify the code, you have to live with it and may just try to change the hardware, that mediates the flow ( from a 1 Gbps interface to 10 GBE to 40 GBE fabrics, if benchmarking tests support that and budget permits ).

If you can change the code, take a look on sample Test-cases demonstrated as a methodology for a principal approach to isolate the root-cause of the actual bottleneck and keep the { cause: remedy, ... } iterations as a method to fix things that can go better.

A broader view:

How long will it take to read ( N ) blocks from an 0.8 [TB] disk file and get ( N - 1 ) of them just sent across a LAN?

For a rough estimate, let's refresh a few facts about how these things actually work:

           0.5 ns - CPU L1 dCACHE reference
           1   ns - speed-of-light (a photon) travel a 1 ft (30.5cm) distance
           5   ns - CPU L1 iCACHE Branch mispredict
           7   ns - CPU L2  CACHE reference
          71   ns - CPU cross-QPI/NUMA best  case on XEON E5-46*
         100   ns - MUTEX lock/unlock
         100   ns - own DDR MEMORY reference
         135   ns - CPU cross-QPI/NUMA best  case on XEON E7-*
         202   ns - CPU cross-QPI/NUMA worst case on XEON E7-*
         325   ns - CPU cross-QPI/NUMA worst case on XEON E5-46*
      10,000   ns - Compress 1K bytes with Zippy PROCESS
      20,000   ns - Send 2K bytes over 1 Gbps NETWORK
     250,000   ns - Read 1 MB sequentially from MEMORY
     500,000   ns - Round trip within a same DataCenter
  10,000,000   ns - DISK seek
  10,000,000   ns - Read 1 MB sequentially from NETWORK
  30,000,000   ns - Read 1 MB sequentially from DISK
 150,000,000   ns - Send a NETWORK packet CA -> Netherlands
|   |   |   |
|   |   | ns|
|   | us|
| ms|

Processors differ a lot in their principal internal ( NUMA in effect ) architectures:

Core i7 Xeon 5500 Series Data Source Latency (approximate)               [Pg. 22]

local  L1 CACHE hit,                              ~4 cycles (   2.1 -  1.2 ns )
local  L2 CACHE hit,                             ~10 cycles (   5.3 -  3.0 ns )
local  L3 CACHE hit, line unshared               ~40 cycles (  21.4 - 12.0 ns )
local  L3 CACHE hit, shared line in another core ~65 cycles (  34.8 - 19.5 ns )
local  L3 CACHE hit, modified in another core    ~75 cycles (  40.2 - 22.5 ns )

remote L3 CACHE (Ref: Fig.1 [Pg. 5])        ~100-300 cycles ( 160.7 - 30.0 ns )

local  DRAM                                                   ~60 ns
remote DRAM                                                  ~100 ns

Yet, while these Intel published details data influence any and all performance planning, these figures are neither granted, not constant, as Intel warns in there published remark:

_{"NOTE: THESE VALUES ARE ROUGH APPROXIMATIONS. THEY DEPEND ON CORE AND UNCORE FREQUENCIES, MEMORY SPEEDS, BIOS SETTINGS, NUMBERS OF DIMMS, ETC,ETC..YOUR MILEAGE MAY VARY."}

I like the table, where orders of magnitude are visible, someone might prefer a visual form, where colours "shift"-the-paradigm, based originally on Peter Norvig's post:

If my budget, deadlines and the simulation software would permit, I would prefer ( performance-wise, due to latency masking and data-locality based processing ) to go without mpich layer into maximum number of CPU + MEM-controller channels per CPU.

With a common sense for architecture-optimised processing design, it was shown, that the same results could be received in a pure-[SERIAL]-code somewhere even ~50x ~ 100x faster than a best case in the "original" code or if using silicon-architecture "naive" programming tools and improper paradigms.

Today systems are capable of serving the vast simulations' needs this way, without spending more than received.

Hope that also your conditions will allow you to go smart into this, performance subordinated direction, to cut the simulations run-time to a way less than a month.

*"I can imagine, how hard it could get to receive some remarkable processing-quota granted, if both deadlines and budget are moving against you"* - Don't assume that! In areas where I have worked, it does not involve money and effort to get entry level compute capacity is very reasonable. Like spending 1 hour for writing a simple proposal for 50k "free" CPU hours that you get within 2 days is totally in the realm of possibilities. Of course I can't speak for every HPC center, but usually they want to keep a high utilization and scientific output. — Zulan, Dec 02 '17 at 10:22
Regarding your answer, I feel can give you a good general idea, but overall it's too general and simplified for the specific problem, with the risk of drifting into speculation. Of course a better answer is hardly possible without actual reproducible performance analysis and deep knowledge of RegCM. — Zulan, Dec 02 '17 at 10:26
Cannot confirm the 1st remark. Yes, there are initiatives for growing HPC capacities. That may create a false illusion of generating surplus supply in HPC-grade [CPU.hrs] & that it may let more HPC-jobs get now "spare"-[CPU.hrs]-quota from ( now sort of aging ) previous generations of HPC-fabric, as top priority HPC-jobs get placed on newest toys. But there is also an opposite working process of de-commissioning of older generations of HPC-hardware ( grids | clusters ) that the HPC-centers are not willing to spend power on & pay electricity bills, as a higher performant processors stepped in ) — user3666197, Dec 02 '17 at 11:33
Thank you @user3666197 for your effort. The simplified answers at the top were more helpful. Will turning off hyperthreading help ? Because the program RegCM (whose source, ofcourse I can't change) was designed to support hyperthreading for parallel processing and even clustering. However, no resources are available to assist in setting up a cluster. They assume we already have network administrators for that job. Regarding bottleneck, I don't think 1GBps ethernet is one, the max LAN speed achieved was 100MBps,and that was maximum value. Rest of the time the LAN speed was limited to 20-30. — anup, Dec 04 '17 at 05:41
Memory usage was not too high, the 32 GB RAM seemed to go cool, with under approx 60% usage. FileIO could be a problem. As I mentioned in the question above (the third tabular entry and the note below it) the disk write speed was altering with the order of node that I supply in the command. I'm not really able to decide on this since writing was done in NVMe SSD (must be M.2). The RegCM probably doesnot have poor algorithm. It was developed by ICTP and thousands of researchers use it worldwide. I believe making efficient use of multiple cores will solve the problem, but its just what I think. — anup, Dec 04 '17 at 05:51
The HyperThreading may help in case a poor I/O-ops permit for such CPU_core process-flow interleaving trick ( which may help mask parts or the whole the I/O latency delays ) but always at a known cost of uncontrollably lost CPU_core cache pre-fetched data, which means devastated CPU-bound computing (under)performance. So the nature of costs v/s effects makes HT always to be rather an ultimate HPC-performance evil :o) — user3666197, Dec 08 '17 at 18:02

RegCM, MPICH, Computer Clustering

Questions:

1 Answers1

INTRO:Simplified Answers to Questions on a yet hidden complex-system:

SOLUTION:What one can always do about this design dilemma?

INTRO:
Simplified Answers to Questions on a yet hidden complex-system:

SOLUTION:
What one can always do about this design dilemma?