2

I'm working on a simulation where the bottleneck is performing a large number of complex double precision matrix exponentials, and am finding that while Fortran (Expokit) works well for small matrices, for larger ones it performs worse than Matlab or Python.

I have included a model program showing similar behavior below, though it requires larger matrices to show the performance difference. Looking at the profiler and the source code it seems that Expokit spends most of its time calling zgemm(), and so my only thought is an issue with my BLAS installation. Otherwise I don't understand why Fortran would be performing worse than Matlab or Python. I would greatly appreciate any insight into improving the performance of the Fortran matrix exponential code.

Results for 10000 matrices (4x4, 8x8, 30x30, 60x60, 80x80):

Matlab (s): 0.91, 0.97, 2.36, 5.45, 8.69

Python (s): 2.59, 2.89, 9.70, 35.4, 72.7

Fortran, Expokit (s): 0.037, 0.12, 4.14, 30.6, 74.9

Fortran, Expokit, OpenMP with 8 cores (s): 0.0039, 0.016, 0.52, 3.87, 9.53

Fortran code:

    subroutine expokit_test()

    use omp_lib
    use iso_fortran_env
    implicit none

    integer, parameter :: wp = selected_real_kind(15, 307), size=80
    complex(wp), parameter :: i = (0, 1._wp)
    integer :: count, a, b
    real(wp) :: wtime
    complex(wp) :: mat_exp(size, size),  mat(size, size), val

    val = 1E-8_wp
    mat = 0._wp

    do a = 1, size
        do b = 1, size
            mat(a, b) = a * b
        end do
    end do

    call omp_set_num_threads(8)
    wtime = omp_get_wtime()
    !$omp parallel do default(private) &
    !$omp& shared(mat, val)
    do count = 1, int(1E4)
        mat_exp = expm_complex(-i * mat * val)
    end do
    !$omp end parallel do
    wtime = omp_get_wtime () - wtime
    write(6, *) 'expm_complex', sngl(wtime)

end subroutine expokit_test

function expm_complex(A) result(B)
        ! Calculate matrix exponential of complex matrix A using Expokit

        use iso_fortran_env
        implicit none

        integer, parameter :: wp = selected_real_kind(15, 307)
        complex(wp), dimension(:, :), intent(in) :: A
        complex(wp), dimension(size(A, 1), size(A, 2)) :: B

        integer, parameter :: ideg = 2 ! Pade approximation, 6 is reccomended but 2 appears to be stable
        complex(wp) :: t = 1._wp
        complex(wp), dimension(4 * size(A, 1) * size(A, 2) + ideg + 1) :: wsp
        integer, dimension(size(A, 1)) :: iwsp
        integer :: iexp, ns, iflag, n
        n = size(A, 1)

        call ZGPADM(ideg, n, t, A, n, wsp, size(wsp, 1), iwsp, iexp, ns, iflag)
        B = reshape(wsp(iexp : iexp + n * n - 1), [n, n])

    end function expm_complex

Matlab code:

size = 80;

for a=1:size
    for b=1:size
        A(a, b) = 1E-9 + (a * b);
    end

end

tic
for test=1:1E4
    t=expm(-1i*A*1E-8);
end
toc

Python code:

size = 80
mat = np.ones((size, size))

for a in range(0, size):
    for b in range(0, size):
        mat[a, b] = ((a+1) * (b+1))

mat = mat + 1E-9

start = time.time()
for loop in range(0, int(1E4)):
    test = la.expm(-1j * mat * 1E-8)


end = time.time() - start
print('time taken', end)
Chris
  • 37
  • 7
  • 1
    So what kind of BLAS implementation do you use then? – Vladimir F Героям слава Jul 26 '18 at 13:31
  • 1
    [Why is MATLAB so fast in matrix multiplication?](https://stackoverflow.com/q/6058139/5211833) – Adriaan Jul 26 '18 at 13:34
  • I'm using gfortran installed using Cygwin with default LAPACK/BLAS (not sure which implementation), though I have also tested on a Ubuntu dual boot also using gfortran and default LAPACK installation and get similar performance. – Chris Jul 26 '18 at 13:35
  • 3
    So install some faster BLAS and test it. It is really hard to argue about performance of a computer we cannot access. – Vladimir F Героям слава Jul 26 '18 at 13:40
  • Okay, thanks. Do you have any specific recommendation? – Chris Jul 26 '18 at 13:43
  • 1
    Fortran is a column major language. Interchange the do-loops that initialize `mat(a,b)`. Striding through memory with the larger matrix may be blowing out the cache. – evets Jul 26 '18 at 15:09
  • 2
    It does not matter too much, just some optimized BLAS. OpenBLAS, MKL, GotoBLAS, ATLAS, any of them. Also fix what evets suggests. – Vladimir F Героям слава Jul 26 '18 at 21:09
  • In looking at the Fortran code again, I think the call to `RESHAPE` with an array section may be slow. I suspect most compilers will make a temporary copy of the array section. – evets Jul 27 '18 at 04:49
  • How would you recommend improving that @evets? I assume assigning a new variable to the array section would be the same as the compiler doing so. – Chris Jul 27 '18 at 07:39

0 Answers0