While benchmarking 'subtracting a vector from a matrix', I noticed Fortran compilers appear to be performing some sort of optimization when I reuse variables/code. It looks like the arrays are being reused from cache memory, however I'm not sure. I believe this optimization is causing discrepancies in my benchmark results and would like to identify the specific type of optimization and, if possible, turn it off.
For example, in the following code that compares 2 cases, an additional Case 3 is introduced which is identical to Case 1. However, the time taken to run Case 3 is reported to be much lesser than that for Case 1.
program main
implicit none
integer :: n = 1E7
real*8, dimension(3) :: a
real*8, allocatable, dimension(:, :) :: b, c
real :: start, finish
integer :: i
allocate(b(n, 3))
allocate(c(n, 3))
call random_number(a)
call random_number(b)
! Case 1: Do loop
call cpu_time(start)
do i = 1, 3
c(:, i) = b(:, i) - a(i)
enddo
call cpu_time(finish)
print*, 'do-loop : ', finish-start
! Case 2: Spread
call cpu_time(start)
c = b - spread(a, dim=1, ncopies=n)
call cpu_time(finish)
print*, 'spread : ', finish-start
! Case 3: Do loop (again)
call cpu_time(start)
do i = 1, 3
c(:, i) = b(:, i) - a(i)
enddo
call cpu_time(finish)
print*, 'do-loop : ', finish-start
end program main
This produces similar results with Intel and GNU compilers as shown below. I have tried investigating using flags like -O0
and -qopt-report
, but cannot understand why the code behaves so. Because the arrays are large, ulimit -s unlimited
might be required (on Linux) to avoid a segmentation fault.
$ ifort reuse.f90 && ./a.out
do-loop : 0.2072840
spread : 0.4781271
do-loop : 3.6670923E-02
$ gfortran reuse.f90 && ./a.out
do-loop : 0.232345015
spread : 0.342370987
do-loop : 4.52849865E-02