I've read article [1] about running linpack on AMD. The execution strategy from my understanding is to have 1 MPI rank per L3 cache with 4 threads each as the l3 cache is for four physical cores. Now, by reading the article, I've got three questions that I cannot answer by googling:
1) He is benchmarking a single CPU system. I guess in general OpenMPI is used to deploy linpack on a cluster. But is there any performance benefit on using several MPI ranks instead of one rank with more threads? On a multi-socket/shared-memory machine, this from my understanding should not make any difference.
2) He runs the benchmark as follows:
export OMP_PROC_BIND=TRUE
export OMP_PLACES=cores
export OMP_NUM_THREADS=4
mpirun -np 8 --map-by l3cache --mca btl self,vader xhpl
My problem is that mpirun's default setting for bind-to is binding to cores. From my understanding, that means that each rank is bound to one core. Now, even though the OMP threads are also bound to cores, I cannot see that the four threads per rank will execute on four cores. Instead, as the rank (the process) is bound to one core, the rank consists of four OMP threads that all share the same core which is not intended? I have no CPU to verify my assumption. Am I correct that the --bind-to l3cache setting is missing here to allow the OMP threads to distribute over all cores sharing the l3 cache? If not, why?
3) He states that one should disable SMT when benchmarking. Why? I know hardware threads may not always increase performance if the shared execution units like FPUs are saturated, but why could they decrease performance?
Thank you very much for your help.
Kind regards, Maximilian