0

While benchmarking 'subtracting a vector from a matrix', I noticed Fortran compilers appear to be performing some sort of optimization when I reuse variables/code. It looks like the arrays are being reused from cache memory, however I'm not sure. I believe this optimization is causing discrepancies in my benchmark results and would like to identify the specific type of optimization and, if possible, turn it off.

For example, in the following code that compares 2 cases, an additional Case 3 is introduced which is identical to Case 1. However, the time taken to run Case 3 is reported to be much lesser than that for Case 1.

program main
  implicit none

  integer :: n = 1E7
  real*8, dimension(3) :: a
  real*8, allocatable, dimension(:, :) :: b, c
  real :: start, finish
  integer :: i

  allocate(b(n, 3))
  allocate(c(n, 3))

  call random_number(a)
  call random_number(b)

  ! Case 1: Do loop
  call cpu_time(start)
  do i = 1, 3
    c(:, i) = b(:, i) - a(i)
  enddo
  call cpu_time(finish)
  print*, 'do-loop : ', finish-start

  ! Case 2: Spread
  call cpu_time(start)
  c = b - spread(a, dim=1, ncopies=n)
  call cpu_time(finish)
  print*, 'spread  : ', finish-start

  ! Case 3: Do loop (again)
  call cpu_time(start)
  do i = 1, 3
    c(:, i) = b(:, i) - a(i)
  enddo
  call cpu_time(finish)
  print*, 'do-loop : ', finish-start

end program main

This produces similar results with Intel and GNU compilers as shown below. I have tried investigating using flags like -O0 and -qopt-report, but cannot understand why the code behaves so. Because the arrays are large, ulimit -s unlimited might be required (on Linux) to avoid a segmentation fault.

$ ifort reuse.f90 && ./a.out 
 do-loop :   0.2072840    
 spread  :   0.4781271    
 do-loop :   3.6670923E-02

$ gfortran reuse.f90 && ./a.out
 do-loop :   0.232345015    
 spread  :   0.342370987    
 do-loop :    4.52849865E-02

Cibin Joseph
  • 1,173
  • 1
  • 11
  • 16
  • The `c` array will only be truly finalized in memory when you first access it. That is controlled in the system, you cannot control that from the compiler. Also, the CPU might not be running on full power at the beginning. Be prepared to run all tests multiple times always. – Vladimir F Героям слава Sep 19 '22 at 12:34
  • 1
    To add upon @VladimirFГероямслава comment, if you were initialising c(:) at the beginning of the code (`c(:)=0.0` will do), your two do-loops would likely have similar runtimes. – PierU Sep 19 '22 at 13:04
  • 1
    A good optimising compiler could throw away the entirety of the code inside the 3rd loop since its results are not used. – High Performance Mark Sep 19 '22 at 13:22
  • Ah! Thanks, everyone that makes sense. @HighPerformanceMark I had my doubts if that was what was happening here! – Cibin Joseph Sep 19 '22 at 14:35
  • Thanks @PierU, that solves my question. If you make that an answer, I shall accept it. – Cibin Joseph Sep 19 '22 at 14:36

1 Answers1

3

At least in Linux, the memory allocator uses the "optimistic memory allocation strategy" (or see Why can Fortran allocate such large arrays? for Fortran). It assumes that there will be enough memory, assigns the virtual address space and that is all. The memory pages are only assigned when you access the memory by assigning some values (or trying to read the undefined garbage).

That has two implication.

  1. If you requested too much memory, the allocate may still succeed and the program may crash later.

  2. The first access will take more time.

To remove the problem with the latter, initialize the memory first, e.g. C = 0.

There are other reasons why you should disregard the first runs of any tests and always run them multiple times - not just one long test, but multiple short runs. There are various turbo modes in modern CPUs that may take some time to start, for example.