I am trying to calculate something similar to a weighted matrix inner product in Fortran. The current script that I am using for calculating the inner product is as follows
! --> In
real(kind=8), intent(in), dimension(ni, nj, nk, nVar) :: U1, U2
real(kind=8), intent(in), dimension(ni, nj, nk) :: intW
! --> Out
real(kind=8), intent(out) :: innerProd
! --> Local
integer :: ni, nj, nk, nVar, iVar
! --> Computing inner product
do iVar = 1, nVar
innerProd = innerProd + sum(U1(:,:,:,iVar)*U2(:,:,:,iVar)*intW)
enddo
But I found that the above script that I am currently using is not very efficient. The same operation can be performed in Python using NumPy as follows,
import numpy as np
import os
# --> Preventing numpy from multi-threading
os.environ['OPENBLAS_NUM_THREADS'] = '1'
os.environ['MKL_NUM_THREADS'] = '1'
innerProd = 0
# --> Toy matrices
U1 = np.random.random((ni,nj,nk,nVar))
U2 = np.random.random((ni,nj,nk,nVar))
intW = np.random.random((ni,nj,nk))
# --> Reshaping
U1 = np.reshape(np.ravel(U1), (ni*nj*nk, nVar))
U2 = np.reshape(np.ravel(U1), (ni*nj*nk, nVar))
intW = np.reshape(np.ravel(intW), (ni*nj*nk))
# --> Calculating inner product
for iVar in range(nVar):
innerProd = innerProd + np.dot(U1[:, iVar], U2[:, iVar]*intW)
The second method using Numpy seems to be far more faster than the method using Fortran. For a specific case of ni = nj = nk = nVar = 130
, the time taken by the two methods are as follows
fortran_time = 25.8641 s
numpy_time = 6.8924 s
I tried improving my Fortran code with ddot
from BLAS as follows,
do iVar = 1, nVar
do k = 1, nk
do j = 1, nj
innerProd = innerProd + ddot(ni, U1(:,j,k,iVar), 1, U2(:,j,k,iVar)*intW(:,j,k), 1)
enddo
enddo
enddo
But there was no considerable improvement in time. The time taken by the above method for the case of ni = nj = nk = nVar = 130
is ~24s
. (I forgot to mention that I compiled the Fortran code with '-O2' option for optimizing the performance).
Unfortunately, there is no BLAS function for element-wise matrix multiplication in Fortran. And I don't want to use reshape in Fortran because unlike python reshaping in Fortran will lead to copying my array to a new array leading to more RAM usage.
Is there any way to speed up the performance in Fortran so as to get close to the performance of Numpy?