How to run MPI-enabled LAMMPS on Mac mini with M1 chip?

Question

Recently I'm doing some molecular dynamics simulations (LAMMPS) on my M1 Mac mini.

For a simple task, I use command:

lmp_serial -in run.in.npt

I believe it means single CPU calculation. It takes me 4 min 45 second running time. Then I tried to use 4 cores running:

mpirun -np 4 lmp_mpi -in run.in.npt

it takes me 1 min 51 second.

But when I use 8 cores:

mpirun -np 8 lmp_mpi -in run.in.npt

it did't run faster, takes me 3 min 38 second.

For comparison, then I tried 2 cores and 6 cores, I summaries here:

1 core  : 4 min 45 second
2 cores : 2 min 55 second
4 cores : 1 min 51 second
6 cores : 4 min 45 second
8 cores : 3 min 38 second

Someone know what's the reason? Is that something related to Open MPI? (If it wasn't pre-installed in the Mac, I didn't install it later)

on the apple official website, they describe the M1 chip as following : 8-core CPU with 4 performance cores and 4 efficiency cores; 8-core GPU; 16-core Neural Engine — Yilong Pan, Jan 04 '21 at 20:18
Adding slow cores to an MPI job usually results in decreased overall efficiency, unless it is a job of type "bag of work" which adapts itself to the computing environment. LAMMPS is not meant to be run on CPUs with different performance characteristics. — Hristo Iliev, Jan 05 '21 at 10:57

score 1 · Answer 1 · answered Jan 05 '21 at 11:27

LAMMPS distributes the work by performing domain decomposition. For example, with spatial decomposition, the simulation volume is cut into boxes and all particles in a particular box are assigned to some MPI rank for processing. Because integration in molecular dynamics is a globally synchronous operation, each step completes only then when all MPI ranks are done. Thus, the time it takes to complete an integration step is equal to the time it takes the slowest MPI rank to complete its assigned box(es).

M1 is an architecture similar to ARM DynamIQ (successor of big.LITTLE), packing both cores that are fast but power hungry (Firestorm cores) and cores that are slow but power efficient (Icestorm cores). The M1 chip in Mac Mini has four of each type. Since macOS does not provide explicit CPU affinity, the MPI library simply launches the given number of ranks and it is up to the OS to schedule them as efficiently as possible on the available cores. With up to four MPI ranks, they all end up on their own Firestorm cores. Once you get past four MPI ranks, some will either land on one of the slow Icestorm cores or will timeshare a fast core with some other rank. In both cases, there will be at least one rank that will be much slower than the rest and hence the overall performance will suffer.

Another thing to note is that modern CPUs vary their core frequencies based on the thermal conditions. If only one core is loaded, it may be boosted way over its nominal frequency. When more cores are loaded, more heat is produced and the cores will not get as high boost in frequency as in the former case. Therefore, going from 1 to 2 to 4 cores will not give you linear speed-up, even if the problem is embarrassingly parallel.

Also, keep in mind that the kind of domain decomposition used can have an enormous effect on the parallel performance. For example, with spacial decomposition, if the atoms are not equally distributed among the subdomains or if they move between them in a non-uniform way, there will be load imbalance and again the speedup will suffer.

Were you able to test this on a M1? I wonder whether a 8 tasks MPI job runs on 4 big + 4 small cores, or simply time share 4 big cores. — Gilles Gouaillardet, Jan 05 '21 at 11:31

How to run MPI-enabled LAMMPS on Mac mini with M1 chip?

1 Answers1