While I run my code on a old Intel Xeon platform(X5650@2.67GHz), the parallel efficiency seems good that 80%~95% speed-up with twice processor usage. However, when I run the same code on on AMD 2990WX platform, I cannot get any acceleration with any numbers of threads.
I am so confused that why my new AMD platform performs so bad parallel efficiency and I can hardly to know where is the wrong settings in my code.
I have a C code based on the PetSc library to solve a very-large sparse linear equation, the parallel part in my code is provided by PetSc which automatically involves MPI ( I just arrange the matrix construction tasks to each process and do not add any other communication routines).
The system of the computation platform are both Centos7, the version of MPI library are both MPICH3, the version of PetSc are both 3.11. The BLAS on XEON platform is included by MKL, whihe the BLAS on AMD platform is included by BLIS library.
While the program is running on the AMD platform, I use top
to check the operation of the processor, and found that the CPU usage are actually different with different run settings:
for 32 processes:
/usr/lib64/mpich/bin/mpiexec -n 32 ./${BIN_DIR}/main
for 64 processes:
/usr/lib64/mpich/bin/mpiexec -n 64 ./${BIN_DIR}/main
on XEON platform:
/public/software/Petsc/bin/petscmpiexec -n 64 -f mac8 ./${BIN_DIR}/main
with mac8
file:
ic1:8
ic2:8
ic3:8
ic4:8
ic5:8
ic6:8
ic7:8
ic8:8