3

I compiled (with GCC and PGI compilers) and run a small Fortran/OpenMP program on two different platforms (Haswell- and Skylake-based), just to get a feeling of the difference of the performance. I do not know how to interpret the results - they are a mistery to me.

Here is the small program (taken from Nvidia Developer website and slightly adapted).

PROGRAM main

use, intrinsic :: iso_fortran_env, only: sp=>real32, dp=>real64
use, intrinsic :: omp_lib

implicit none

real(dp), parameter :: tol   = 1.0d-6
integer, parameter :: iter_max = 1000

real(dp), allocatable :: A(:,:), Anew(:,:)
real(dp) :: error
real(sp) :: cpu_t0, cpu_t1
integer :: it0, it1, sys_clock_rate, iter, i, j
integer :: N, M
character(len=8) :: arg

call get_command_argument(1, arg)
read(arg, *) N   !!! N = 8192 provided from command line

call get_command_argument(2, arg)
read(arg, *) M   !!! M = 8192 provided from command line

allocate( A(N,M), Anew(N,M) )

A(1,:) = 1
A(2:N,:) = 0

Anew(1,:) = 1
Anew(2:N,:) = 0

iter = 0
error = 1

call cpu_time(cpu_t0)
call system_clock(it0)

do while ( (error > tol) .and. (iter < iter_max) )

    error = 0

    !$omp parallel do reduction(max: error) private(i)
    do j = 2, M-1
        do i = 2, N-1
            Anew(i,j) = (A(i+1,j)+A(i-1,j)+A(i,j-1)+A(i,j+1)) / 4
            error = max(error, abs(Anew(i,j)-A(i,j)))
        end do
    end do
    !$omp end parallel do

    !$omp parallel do private(i)
    do j = 2, M-1
        do i = 2, N-1
            A(i,j) = Anew(i,j)
        end do
    end do
    !$omp end parallel do

    iter = iter + 1

end do

call cpu_time(cpu_t1)
call system_clock(it1, sys_clock_rate)

write(*,'(a,f8.3,a)') "...cpu time :", cpu_t1-cpu_t0, " s"
write(*,'(a,f8.3,a)') "...wall time:", real(it1 it0)/real(sys_clock_rate), " s"

END PROGRAM

The two platforms I used are:

  • Intel i7-4770 @ 3.40GHz (Haswell), 32 GB RAM / Ubuntu 16.04.2 LTS
  • Intel i7-6700 @ 3.40GHz (Skylake), 32 GB RAM / Linux Mint 18.1 (~ Ubuntu 16.04)

On each platform I compiled the Fortran program with

  • GCC gfortran 6.2.0
  • PGI pgfortran 16.10 community edition

I obviously compiled the program independently on each platform (I only moved the .f90 file; I did not move any binary file)

I ran 5 times each of the 4 executables (2 for each platform), collecting the wall times measured in seconds (as printed out by the program). (Well, I ran the whole test several times, and the timings below are definitely representative)

  1. Sequential execution. Program compiled with:

    • gfortran -Ofast main.f90 -o gcc-seq
    • pgfortran -fast main.f90 -o pgi-seq

    Timings (best of 5):

    • Haswell > gcc-seq: 150.955, pgi-seq: 165.973
    • Skylake > gcc-seq: 277.400, pgi-seq: 121.794
  2. Multithread execution (8 threads). Program compiled with:

    • gfortran -Ofast -fopenmp main.f90 -o gcc-omp
    • pgfortran -fast -mp=allcores main.f90 -o pgi-omp

    Timings (best of 5):

    • Haswell > gcc-omp: 153.819, pgi-omp: 151.459
    • Skylake > gcc-omp: 113.497, pgi-omp: 107.863

When compiling with OpenMP, I checked the number of threads in the parallel regions with omp_get_num_threads(), and there are actually always 8 threads, as expected.

There are several things I don't get:

  • Using the GCC compiler: why on Skylake OpenMP has a substantial benefit (277 vs 113 s), while on Haswell it has no benefit at all? (150 vs 153 s) What's happening on Haswell?
  • Using the PGI compiler: Why OpenMP has such a small benefit (if any) on both platforms?
  • Focusing on the sequential runs, why are there such huge differences in execution times between Haswell and Skylake (especially when the program is compiled with GCC)? Why this difference is still so relevant - but with the role of Haswell and Skylake reversed! - when OpenMP is enabled?
  • Also, when OpenMP is enabled and GCC is used, the cpu time is always much larger than the wall time (as I expect), but when PGI is used, the cpu and wall times are always the same, also then the program used multiple threads.

How can I make some sense out of these results?

undy
  • 93
  • 7
  • 1
    Are the numbers you report from `system_clock`? Be aware that your computation speed is limited by memory bandwith, not by CPU core speed. (And as always, `real(8)` is an uglu non-portable code smell.) – Vladimir F Героям слава Mar 10 '17 at 11:14
  • Yes, the timings are those from `system_clock` which I believe are wall times. I am aware that the speed is mostly limited by bandwidth (and, after all, the two CPUs work at the same frequency), but I do not see how this can explain the results. I totally agree on the ugliness of real(8). I actually used real64 from the intrinsic iso_fortran_env. I just wanted to keep things short, here. – undy Mar 10 '17 at 13:00
  • It is always good to show the actual and exact code here. I have seen too many important points omitted by other askers (and also it gives a bad example to beginners to see `real(8)`). The point is that the scaling with the number of cores will be worse because 1. not every core has a separate memory bus, 2. different architectures have different numbers of memory channels with different speeds (not sure about these two), 3. hyperthreading is probably harfmful here. – Vladimir F Героям слава Mar 10 '17 at 13:06
  • Concerning the code, you are totally right. I am going to edit the post right now and put the exact source as it is. Also the explanation is interesting, but what about the fact that on Haswell enabling OpenMP is irrelevant (from ~151 to ~154 s), while on Skylake I have a great benefit (from 277 s down to 113 s), at least with GCC? – undy Mar 10 '17 at 13:59
  • Due to higher memory bandwith in Skylake (36% more)? I don't know. – Vladimir F Героям слава Mar 10 '17 at 15:30
  • These usually take a while to work through. I would suggest making the inner loop (do i = 2, N-1) as a function and use OpenMP SIMD. One you get the guts of working, then you could work at parallelising the outer loop. However my experience is that almost always I am better on 1 core if I am memory bandwidth limited, and the function being vectored is as good as it gets. Often do the parallelism just runs slower. All that j-1 stuff is probably not helpful. I would send it both lines (A:j) and A(:,J-1) to the function as seperate (A, B) to see if it improves things. You may want prefetch too. – Holmz Mar 12 '17 at 09:27

0 Answers0