I compiled (with GCC and PGI compilers) and run a small Fortran/OpenMP program on two different platforms (Haswell- and Skylake-based), just to get a feeling of the difference of the performance. I do not know how to interpret the results - they are a mistery to me.
Here is the small program (taken from Nvidia Developer website and slightly adapted).
PROGRAM main
use, intrinsic :: iso_fortran_env, only: sp=>real32, dp=>real64
use, intrinsic :: omp_lib
implicit none
real(dp), parameter :: tol = 1.0d-6
integer, parameter :: iter_max = 1000
real(dp), allocatable :: A(:,:), Anew(:,:)
real(dp) :: error
real(sp) :: cpu_t0, cpu_t1
integer :: it0, it1, sys_clock_rate, iter, i, j
integer :: N, M
character(len=8) :: arg
call get_command_argument(1, arg)
read(arg, *) N !!! N = 8192 provided from command line
call get_command_argument(2, arg)
read(arg, *) M !!! M = 8192 provided from command line
allocate( A(N,M), Anew(N,M) )
A(1,:) = 1
A(2:N,:) = 0
Anew(1,:) = 1
Anew(2:N,:) = 0
iter = 0
error = 1
call cpu_time(cpu_t0)
call system_clock(it0)
do while ( (error > tol) .and. (iter < iter_max) )
error = 0
!$omp parallel do reduction(max: error) private(i)
do j = 2, M-1
do i = 2, N-1
Anew(i,j) = (A(i+1,j)+A(i-1,j)+A(i,j-1)+A(i,j+1)) / 4
error = max(error, abs(Anew(i,j)-A(i,j)))
end do
end do
!$omp end parallel do
!$omp parallel do private(i)
do j = 2, M-1
do i = 2, N-1
A(i,j) = Anew(i,j)
end do
end do
!$omp end parallel do
iter = iter + 1
end do
call cpu_time(cpu_t1)
call system_clock(it1, sys_clock_rate)
write(*,'(a,f8.3,a)') "...cpu time :", cpu_t1-cpu_t0, " s"
write(*,'(a,f8.3,a)') "...wall time:", real(it1 it0)/real(sys_clock_rate), " s"
END PROGRAM
The two platforms I used are:
- Intel i7-4770 @ 3.40GHz (Haswell), 32 GB RAM / Ubuntu 16.04.2 LTS
- Intel i7-6700 @ 3.40GHz (Skylake), 32 GB RAM / Linux Mint 18.1 (~ Ubuntu 16.04)
On each platform I compiled the Fortran program with
- GCC gfortran 6.2.0
- PGI pgfortran 16.10 community edition
I obviously compiled the program independently on each platform (I only moved the .f90 file; I did not move any binary file)
I ran 5 times each of the 4 executables (2 for each platform), collecting the wall times measured in seconds (as printed out by the program). (Well, I ran the whole test several times, and the timings below are definitely representative)
Sequential execution. Program compiled with:
- gfortran -Ofast main.f90 -o gcc-seq
- pgfortran -fast main.f90 -o pgi-seq
Timings (best of 5):
- Haswell > gcc-seq: 150.955, pgi-seq: 165.973
- Skylake > gcc-seq: 277.400, pgi-seq: 121.794
Multithread execution (8 threads). Program compiled with:
- gfortran -Ofast -fopenmp main.f90 -o gcc-omp
- pgfortran -fast -mp=allcores main.f90 -o pgi-omp
Timings (best of 5):
- Haswell > gcc-omp: 153.819, pgi-omp: 151.459
- Skylake > gcc-omp: 113.497, pgi-omp: 107.863
When compiling with OpenMP, I checked the number of threads in the parallel regions with omp_get_num_threads(), and there are actually always 8 threads, as expected.
There are several things I don't get:
- Using the GCC compiler: why on Skylake OpenMP has a substantial benefit (277 vs 113 s), while on Haswell it has no benefit at all? (150 vs 153 s) What's happening on Haswell?
- Using the PGI compiler: Why OpenMP has such a small benefit (if any) on both platforms?
- Focusing on the sequential runs, why are there such huge differences in execution times between Haswell and Skylake (especially when the program is compiled with GCC)? Why this difference is still so relevant - but with the role of Haswell and Skylake reversed! - when OpenMP is enabled?
- Also, when OpenMP is enabled and GCC is used, the cpu time is always much larger than the wall time (as I expect), but when PGI is used, the cpu and wall times are always the same, also then the program used multiple threads.
How can I make some sense out of these results?