2

I am trying to calculate something similar to a weighted matrix inner product in Fortran. The current script that I am using for calculating the inner product is as follows

! --> In
real(kind=8), intent(in), dimension(ni, nj, nk, nVar) :: U1, U2
real(kind=8), intent(in), dimension(ni, nj, nk) :: intW

! --> Out
real(kind=8), intent(out) :: innerProd

! --> Local
integer :: ni, nj, nk, nVar, iVar

! --> Computing inner product
do iVar = 1, nVar
    innerProd = innerProd + sum(U1(:,:,:,iVar)*U2(:,:,:,iVar)*intW)
enddo

But I found that the above script that I am currently using is not very efficient. The same operation can be performed in Python using NumPy as follows,

import numpy as np 
import os

# --> Preventing numpy from multi-threading
os.environ['OPENBLAS_NUM_THREADS'] = '1'
os.environ['MKL_NUM_THREADS'] = '1'   

innerProd = 0

# --> Toy matrices
U1 = np.random.random((ni,nj,nk,nVar))
U2 = np.random.random((ni,nj,nk,nVar))
intW = np.random.random((ni,nj,nk))

# --> Reshaping 
U1 = np.reshape(np.ravel(U1), (ni*nj*nk, nVar))
U2 = np.reshape(np.ravel(U1), (ni*nj*nk, nVar))
intW = np.reshape(np.ravel(intW), (ni*nj*nk))

# --> Calculating inner product
for iVar in range(nVar):
    innerProd = innerProd + np.dot(U1[:, iVar], U2[:, iVar]*intW)

The second method using Numpy seems to be far more faster than the method using Fortran. For a specific case of ni = nj = nk = nVar = 130, the time taken by the two methods are as follows

 fortran_time = 25.8641 s
 numpy_time = 6.8924 s

I tried improving my Fortran code with ddot from BLAS as follows,

do iVar = 1, nVar
    do k = 1, nk
        do j = 1, nj
            innerProd = innerProd + ddot(ni, U1(:,j,k,iVar), 1, U2(:,j,k,iVar)*intW(:,j,k), 1)
        enddo
    enddo
enddo

But there was no considerable improvement in time. The time taken by the above method for the case of ni = nj = nk = nVar = 130 is ~24s. (I forgot to mention that I compiled the Fortran code with '-O2' option for optimizing the performance).

Unfortunately, there is no BLAS function for element-wise matrix multiplication in Fortran. And I don't want to use reshape in Fortran because unlike python reshaping in Fortran will lead to copying my array to a new array leading to more RAM usage.

Is there any way to speed up the performance in Fortran so as to get close to the performance of Numpy?

  • 1
    Note that kind=8 is non-portable and ugly. – Vladimir F Героям слава Apr 08 '18 at 13:40
  • Are you positive that numpy uses a single core? – Andras Deak -- Слава Україні Apr 08 '18 at 13:45
  • 1
    @AndrasDeak Yes, I am pretty sure numpy is using only a single core. I checked the core usage using htop. I was pretty confident that NumPy was faster because of multi-threading but to my surprise I was wrong. – AdhityaRavi Apr 08 '18 at 13:50
  • I wonder if replicating your weight array to the full size and getting rid of the loop altogerher would help. Can you try that? (I left this comment, then thought I was confused and deleted it, but now I again think this would actually correspond to your inner product; but I'm still unsure). – Andras Deak -- Слава Україні Apr 08 '18 at 13:57
  • @AndrasDeak Yes, that could be an option. But, I think the `sum` function and direct matrix multiplication in Fortran are slowing down the performance. Also, aren't loops in Fortran supposed to be faster than the loops in Python?. This may be a very dumb question as I am very new to Fortran, but its something I learned from my colleagues using Fortran. – AdhityaRavi Apr 08 '18 at 14:11
  • one trick you can play is to pass the multi dimensional arrays to an *external* routine with the dummy arguments 1-d. (Effectively flattening or reshaping the array without making a copy.) Then you can use `dot_product`, or a single `do` loop. – agentp Apr 08 '18 at 14:16
  • _Native_ python loops `for val in iterable: ...` are slow. Numpy uses compiled C code under the hood, so at its best it's as fast as a compiled C or fortran alternative. Actually, I'm pretty sure `numpy.dot` does use some blas under the hood, as also the core number switches exemplify. Can't you skip the reshape and just pass your arrays to the appropriate blas routine which use assumed shape inside? Also, I don't know if your call to `sum` with slices leads to the creation of temporary arrays, which could explain the speed difference. – Andras Deak -- Слава Україні Apr 08 '18 at 14:17
  • @AndrasDeak Yes, I thought the same. I am not able to find out which BLAS function does `numpy.dot` exactly use. I tried manipulating `dgemm` and `ddot` to calculate the inner product. But, the it was not improving the performance in anyway. I will edit my question with the `ddot` script that I tried. – AdhityaRavi Apr 08 '18 at 14:28
  • @agentp I am really sorry. But I don't follow your suggestion. Can you please elaborate a bit? – AdhityaRavi Apr 08 '18 at 14:29
  • actually cant you do `ddot(ni*nj*nk, U1(1,1,1,iVar), 1, U2(:,:,:,iVar)*intW, 1)` ? – agentp Apr 08 '18 at 14:42
  • @agentp I tried it. Unfortunately, the time taken is still around `~23-24s`. – AdhityaRavi Apr 08 '18 at 14:59
  • 3
    I have tried your code, and the loop finishes in 0.6s for me. How are you timing your problem? – kvantour Apr 09 '18 at 13:04
  • Hi @AdhityaRavi, did you get to the bottom of this? I'm also getting 3x slower behaviour in Fortran (gfortran 5.4.0) vs Numpy on an A' A multiplication with a 2400 by 300 matrix. Tried `DGEMM`, `matmul` and optimised loops in Fortran to no effect. – Ed Smith Jan 27 '19 at 22:29

3 Answers3

2

You may not be timing what you think are timing. Here's a complete fortran example

program test                                                        
    use iso_fortran_env, r8 => real64                               
    implicit none                                                   

    integer, parameter :: ni = 130, nj = 130, nk = 130, nvar = 130  
    real(r8), allocatable :: u1(:,:,:,:), u2(:,:,:,:), w(:,:,:)     
    real(r8) :: sum, t0, t1                                         
    integer :: i,j,k,n                                              

    call cpu_time(t0)                                               
    allocate(u1(ni,nj,nk,nvar))                                     
    allocate(u2(ni,nj,nk,nvar))                                     
    allocate(w(ni,nj,nk))                                           
    call cpu_time(t1)                                               
    write(*,'("allocation time(s):",es15.5)') t1-t0                 

    call cpu_time(t0)                                               
    call random_seed()                                              
    call random_number(u1)                                          
    call random_number(u2)                                          
    call random_number(w)                                           
    call cpu_time(t1)                                               
    write(*,'("random init time (s):",es15.5)') t1-t0               

    sum = 0.0_r8                                                    
    call cpu_time(t0)                                               
    do n = 1, nvar                                                  
        do k = 1, nk                                                
            do j = 1, nj                                            
                do i = 1, ni                                        
                    sum = sum + u1(i,j,k,n)*u2(i,j,k,n)*w(i,j,k)    
                end do                                              
            end do                                                  
        end do                                                      
    end do                                                          
    call cpu_time(t1)                                               
    write(*,'("Sum:",es15.5," time(s):",es15.5)') sum, t1-t0        

end program

And the output:

$ gfortran -O2 -o inner_product inner_product.f90            
$ time ./inner_product 
allocation time(s):    3.00000E-05
random init time (s):    5.73293E+00
Sum:    3.57050E+07 time(s):    5.69066E-01

real    0m6.465s
user    0m4.634s
sys 0m1.798s

Computing the inner product is less that 10% of the runtime in this fortran code. How/What you are timing is very important. Are you sure you are timing the same things in the fortran and python versions? Are you sure you are only timing the inner_product calculation?

ptb
  • 2,138
  • 1
  • 11
  • 16
1

This avoids making any copy. (note the blas ddot approach still needs to make a copy for the element-wise product)

   subroutine dot3(n,a,b,c,result)
   implicit none
   real(kind=..) a(*),b(*),c(*),result
   integer i,n
   result=0
   do i=1,n
    result=result+a(i)*b(i)*c(i)
   enddo
   end

dot3 is external, meaning not in a module/contains construct. kind should obviously match main declaration.

in main code:

  innerprod=0
  do iVar = 1, nVar 
  call dot3(ni*nj*nk, U1(1,1,1,iVar),U2(1,1,1,iVar),intW,result)
  innerProd=innerProd+result
  enddo
agentp
  • 6,849
  • 2
  • 19
  • 37
  • Yes, I tried this approach just now. But still no improvement in the run time. It's still `~24-25 s`. I am surprised that all the methods I have tried in Fortran have almost the same run time. – AdhityaRavi Apr 08 '18 at 15:26
  • @AdhityaRavi I wouldn't be too surprised by that. Near the metal you just need to loop over every item in memory, multiply three doubles, and sum up the result; there's not much room for improvement. What we can do is try to minimize overheads, such as from creating temporary arrays. It's only in higher-level languages that huge differences can arise depending on your approach, e.g. a native python loop vs numpy magic. But I'd still suspect multicore shenanigans if the factor of 4 difference doesn't want to go away whatever you do... – Andras Deak -- Слава Україні Apr 08 '18 at 15:28
  • @AndrasDeak I guess you are correct. I am also getting a feeling that numpy is multi-threading somehow as I can find no other explanation for it being 4 to 5 times faster. I tried to suppress this using `os.environ['OPENBLAS_NUM_THREADS'] = '1'` and `os.environ['MKL_NUM_THREADS'] = '1'`. But, I am not sure how else to make sure that numpy is working with a single core. – AdhityaRavi Apr 08 '18 at 15:42
  • @AdhityaRavi well as you know the envvar that affects this depends on the blas library being used by numpy, so it might be something else, such as `OMP_NUM_THREADS` I think. The CPU load you mentioned earlier should be somewhat definitive though. Anyway, you can always check by removing the thread limit switches from the numpy version, and seeing if it runs n times faster (and if there's a corresponding increase in CPU load). – Andras Deak -- Слава Україні Apr 08 '18 at 15:46
  • 1
    Note that whether the procedure is external, internal, module or whatever makes no difference to the way sequence association works for arguments. – IanH Apr 08 '18 at 19:59
  • Very good. I Was thinking the compiler would complan about the shape mismatch if it had an interface – agentp Apr 09 '18 at 23:20
1

I had the same observation comparing Numpy and Fortran code.

The difference turns out to be the version of BLAS, I found using DGEMM from netlib is similar to looping and about three times slower than OpenBLAS (see profiles in this answer).

The most surprising thing for me was that OpenBLAS provides code which is so much faster than just compiling a Fortran triple nested loop. It seems this is the whole point of GotoBLAS, which was handwritten in assembly code for the processor architecture.

Even timing the right thing, ordering loops correctly, avoiding copies and using every optimising flag (in gfortran), the performance is still about three times slower than OpenBLAS. I've not tried ifort or pgi, but I wonder if this explains the upvoted comment by @kvantour "loop finishes in 0.6s for me" (note intrinsic matmul is replaced by BLAS in some implementations).

Ed Smith
  • 12,716
  • 2
  • 43
  • 55
  • I am not working on the project for which I posted this question anymore. But, this sounds like a very reasonable explanation. Thank you for your answer and I will forward this thread to my colleagues who are still using the slow fortran code to check if this is what happened in my case too so that I can have some closure ;). – AdhityaRavi Feb 02 '19 at 16:46