Petsc code has no parallel speed-up on 2990WX platform

Question

While I run my code on a old Intel Xeon platform(X5650@2.67GHz), the parallel efficiency seems good that 80%~95% speed-up with twice processor usage. However, when I run the same code on on AMD 2990WX platform, I cannot get any acceleration with any numbers of threads.

I am so confused that why my new AMD platform performs so bad parallel efficiency and I can hardly to know where is the wrong settings in my code.

I have a C code based on the PetSc library to solve a very-large sparse linear equation, the parallel part in my code is provided by PetSc which automatically involves MPI ( I just arrange the matrix construction tasks to each process and do not add any other communication routines).

The system of the computation platform are both Centos7, the version of MPI library are both MPICH3, the version of PetSc are both 3.11. The BLAS on XEON platform is included by MKL, whihe the BLAS on AMD platform is included by BLIS library.

While the program is running on the AMD platform, I use top to check the operation of the processor, and found that the CPU usage are actually different with different run settings:

for 32 processes:

/usr/lib64/mpich/bin/mpiexec -n 32 ./${BIN_DIR}/main

for 64 processes:

/usr/lib64/mpich/bin/mpiexec -n 64 ./${BIN_DIR}/main

on XEON platform:

/public/software/Petsc/bin/petscmpiexec -n 64 -f mac8 ./${BIN_DIR}/main

with mac8 file:

ic1:8
ic2:8
ic3:8
ic4:8
ic5:8
ic6:8
ic7:8
ic8:8

i have try 16,24,32,48,64 and even 128(although 2990WX only has 64 threads) and the code can run with these all settings, but the speed hardly changed. — Nothingts, Jun 06 '19 at 12:19
2 X5650 cpu per node, and 8 nodes in total. Each X5650 has 2 cores and 4 threads. — Nothingts, Jun 06 '19 at 12:20
The X5650 CPU should have 6C and 12T per CPU. https://ark.intel.com/content/www/us/en/ark/products/47922/intel-xeon-processor-x5650-12m-cache-2-66-ghz-6-40-gt-s-intel-qpi.html so you should (with an 8 node system) have 48C / 96T. It could be that you had higher memory bandwith with the old system depending on how the memory was configured (if each node had its own RAM). — drescherjm, Jun 06 '19 at 12:31
On old platform, each node has its own 12G RAM, while the new platform has 64G RAM in total. — Nothingts, Jun 06 '19 at 12:44
by the way, the code only use 38G memory at most while running — Nothingts, Jun 06 '19 at 12:47
It's not about quantity of RAM but how it's connected. In the old system each node has 12GB of ram that is directly connected to the CPU of the node. In the new system it's quad channel possibly split in 2 separate nodes. There will be more contention for ram bandwidth in the new system. — drescherjm, Jun 06 '19 at 12:47
Thank you very much! if I use RAM with higher frequency, would it be better? — Nothingts, Jun 12 '19 at 13:24
It would be marginally better. The old system had quite an advantage in memory performance per node even though the memory runs slower. I believe the old system had 24 total channels of DDR3 versus the new system with 4 channels of faster DDR4. — drescherjm, Jun 12 '19 at 13:30

Petsc code has no parallel speed-up on 2990WX platform

0 Answers0