Parallel computing: how to share computing resources among users?

Question

I am running a simulation on a Linux machine with the following specs.

Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                80
On-line CPU(s) list:   0-79
Thread(s) per core:    2
Core(s) per socket:    20
Socket(s):             2
NUMA node(s):          2
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 85
Model name:            Intel(R) Xeon(R) Gold 6148 CPU @ 2.40GHz
Stepping:              4
CPU MHz:               3099.902
CPU max MHz:           3700.0000
CPU min MHz:           1000.0000
BogoMIPS:              4800.00
Virtualization:        VT-x
L1d cache:             32K
L1i cache:             32K
L2 cache:              1024K
L3 cache:              28160K

This is the run command line script for my solver.

/path/to/meshfree/installation/folder/meshfree_run.sh    # on 1 (serial) worker
/path/to/meshfree/installation/folder/meshfree_run.sh N  # on N parallel MPI processes

I share the system with another colleague of mine. He uses 10 cores for his solution. What would be the fastest option for me in this case? Using 30 MPI processes?

I am a Mechanical Engineer with very little knowledge on parallel computing. So please excuse me if the question is too stupid.

Is the question - Am I better off running this process one time, or kicking it off 30 times? This is a very application specific question and depends on too many variables to answer conclusively. In summary it's a case of 'try it and find out' — bob dylan, Feb 28 '20 at 13:52
Its just one job. I need to allocate the right resources to it. So on a 40 core machine, with 10 cores already being used, am i better off running the code on 30 processors? Also considering hyperthreading. — vikingd, Feb 28 '20 at 13:58
Whats the alternative option you suggest? Why can't you run it and find out what's best? — bob dylan, Feb 28 '20 at 13:59
The alternative would be to use 60 processes, but i am not sure how the processes are split between the processors. It takes around 4 days for one whole simulation and I am running short on time. I am already in the middle of a simulation. — vikingd, Feb 28 '20 at 14:09
Well you'd think before kicking off a 4 day long simulation you'd know the best way to maximize your compute power - usually via using a cut down version (e.g. 10% sample). Again - not a question any one can answer for you as it's too bespoke. — bob dylan, Feb 28 '20 at 14:18

user3666197 · Accepted Answer · 2020-02-28T17:03:07.790

Q : "What would be the fastest option for me in this case? _{...running short on time. I am already in the middle of a simulation.}"

Salutes to Aachen. If it were not for the ex-post remark, the fastest option would be to pre-configure the computing eco-system so that:

check full details of your NUMA device - using lstopo, or lstopo-no-graphics -.ascii not the lscpu
initiate your jobs having as many as possible MPI-worker processes mapped on physical (and best each one exclusively mapped onto its private) CPU-core ( as these deserve this as they carry the core FEM / meshing processing workload )
if your FH policy does not forbid one doing so, you may ask system administrator to introduce CPU-affinity mapping ( that will protect your in-cache data from eviction and expensive re-fetches, that would make 10-CPUs mapped exclusively for use by your colleague and the said 30-CPUs mapped exclusively for your application runs and the rest of the listed resources ~ the 40-CPUs ~ being "shared"-for-use by both, by your respective CPU-affinity masks.

Q : "Using 30 MPI processes?"

No, this is not a reasonable assumption for ASAP processing - use as many CPUs for workers, as possible for an already MPI-parallelised FEM-simulations ( they have high degree of parallelism and most often a by-nature "narrow"-locality ( be it represented as a sparse-matrix / N-band-matrix ) solvers, so the parallel-portion is often very high, compared to other numerical problems ) - the Amdahl's Law explains why.

Sure, there might be some academic-objections about some slight difference possible, for cases, where the communication overheads might got slightly reduced on one-less worker(s), yet the need for a brute-force processing rules in FEM/meshed-solvers ( communication costs are typically way less expensive, than the large-scale, FEM-segmented numerical computing part, sending but a small amount of neighbouring blocks' "boundary"-node's state data )

The htop will show you the actual state ( may note process:CPU-core wandering around, due to HT / CPU-core Thermal-balancing tricks, that decrease the resulting performance )

And do consult the meshfree Support for their Knowledge Base sources on Best Practices.

Next time - the best option would be to acquire a less restrictive computing infrastructure for processing critical workloads ( given a business-critical conditions consider this to be the risk of smooth BAU, the more if impacting your business-continuity ).

**Always welcome,** @vikingd - you may like this https://stackoverflow.com/a/60427809 for both the **performance** impacting details and **interactive graphical tool** (a simulator) of the Amdahl's Law **net-speedups** on the *real-world* `[SERIAL]`-**`[PARALLEL]` workloads** — user3666197, Feb 28 '20 at 15:36

Parallel computing: how to share computing resources among users?

1 Answers1

Linked