0

I am programming on a Knights Landing node which has 68 cores and 4 hyperthreads/core. I am working on a hybrid MPI/OpenMP application. My question is if the 4 hyperthreads are meant to be used as OpenMP threads or how could I use them? When I run my program using the following scheme:

export OMP_NUM_THREADS=1
mpirun -np 68 ./app

it runs much more faster than when I use the scheme:

export OMP_NUM_THREADS=4
mpirun -np 68 ./app

Maybe the problem is that the threads for a certain MPI are not close to each other. However, I don't know how to do it.

In summary, can I use the 4 hyperthreads/core as OpenMP threads?

Thanks.

armando
  • 1,360
  • 2
  • 13
  • 30
  • In the vast majority of applications, there would be no use in running so many threads while running an MPI rank on each core. The default for Intel MPI should be to place the threads locally, but it would be more interesting to first verify that your OpenMP shows a gain on a single MPI rank, using 2 or 4 cores, then try likely combinations of numbers of ranks and threads. – tim18 Oct 21 '17 at 18:07
  • The details of how affinity is set vary among MPI implementations, but an MPI targeted toward KNL should include that feature. – tim18 Oct 21 '17 at 21:55
  • I haven't had enough experience with knl to say whether an application which uses more than 1 thread per core effectively might peak out before all cores are in use. On knc, mpi might keep a core busy messaging and another running os and mpi. – tim18 Oct 22 '17 at 00:03
  • which MPI library are you using ? `mpirun -np 68 grep Cpus_allowed_list /proc/self/status` will tell you how each MPI tasks are bound. note these are physical cpus and they are a bit hard to interpret. you can run `lstopo` in order to understand the topology. on KNL, you should not expect any significant improvement when moving from 2 to 4 threads per core. everything else is application dependent. For example, Intel reported optimal results were achieved on a specific app with flat MPI and 136 tasks per node (which is 2 thread per core, and is a bit counter intuitive) – Gilles Gouaillardet Oct 22 '17 at 00:56
  • Shouldn't you use export ´OMP_NUM_THREADS=272´? as the number is total threads instead of per CPU threads. According to https://gcc.gnu.org/onlinedocs/libgomp/OMP_005fNUM_005fTHREADS.html – Surt Oct 22 '17 at 15:15

1 Answers1

0

As you're probably using Intel MPI and OpenMP runtimes, allow me to forward you some links with valuable information for pinning MPI and OpenMP threads into processor cores/threads. Process/thread binding is a must nowadays to achieve high performance. Even though the OS tries to do its best, moving one process/thread from one core/thread to another location implies that the data needs to be transferred as well. For that matter, take a look at Running an MPI/OpenMP Program and Environment Variables for Process Pinning. For instance, if you run with 68 MPI ranks, then you probably start placing each MPI rank into a different core. You can double check if mpirun is honoring your requests by setting I_MPI_DEBUG environment variable (as described here).

Harald
  • 3,110
  • 1
  • 24
  • 35