Efficient way to compute tensor change of components in Fortran

Question

This is a follow up of this question.

I have an array D(:,:,:) of size NxMxM. Typically, for the problem that I am considering now, it is M=400 and N=600000 (I reshaped the array in order to give the biggest size to the first entry).

Therefore, for each value l of the first entry, D(l,:,:) is an MxM matrix in a certain basis. I need to perform a change of components of this matrix using a basis set vec(:,:), of size MxM, so as to obtain the matrices D_mod(l,:,:).

I think that the easiest way to do it is with:

         D_mod=0.0d0
          
         do b=1,M
         do a=1,M

         do   nu=1,M
         do   mu=1,M
             D_mod(:,mu,nu)=D_mod(:,mu,nu)+vec(mu,a)*vec(nu,b)*D(:,a,b)
         end do
         end do

         end do
         end do

Is there a way to improve the speed of this calculation (also using LAPACK/BLAS libraries)?

I was considering this approach: reshaping D into a N x M^2 matrix D_re; doing the tensor product vec(:,:) x vec(:,:) and reshaping it in order to obtain an M^2 x M^2 matrix vecsq_re(:,:) (this motivates this question); finally, doing the matrix product of these two matrices with zgemm. However, I am not sure this is a good strategy.

EDIT I am sorry, I wrote the question too fast and too late. The size can be up to 600000, yes, but I usually adopt strategies to reduce it by a factor 10 at least. The code is supposed to run on nodes with 100 Gb of memory

Assuming double precision a 600000*400*400 matrix will take up 715.25 Gbyte. Do you have a computer with that much memory? I suspect you are going to have to find a cluster and code it in ScaLAPACK - which is not a trivial undertaking. — Ian Bush, Jan 25 '22 at 08:46
Actually if you distribute over the first dimension you don't need scalapack, it is perfectly parallel, which makes it easier. But I still think you will need a distributed memory solution. — Ian Bush, Jan 25 '22 at 08:49
Thinking about it as long as other parts of the calculation don't need all of D all of the time you could calculate, for example, D( 1:600, :, : ), then D( 601:1200, :, : ) and so on and only require ~0.75 Gbyte, which is obviously doable. — Ian Bush, Jan 25 '22 at 09:57
High Performance Mark I am running on a cluster, thus I don't think that the job is done :) — Gippo, Jan 25 '22 at 11:40
Well then you should be planning to parallelise this job. You might care to take the advice in the answer you've already been given and parallelise the approach outlined. Since it's an embarrassingly parallel job you've got a few choices of parallelisation, perhaps use co-arrays? — High Performance Mark, Jan 25 '22 at 11:43
Yes, you are right, parallelization should be done. It will require some work, though ( this is part of a code which already works in parallel on another variable). — Gippo, Jan 25 '22 at 12:20
Mine wasn't meant to be snippy - just an observation :0 But I do have to say if it "works in parallel on another variable" suggests some form of hierarchical parallelism might work well, with multiple levels of MPI communicators. But the devil is very, very much in the detail - to parallelize a code well on distributed memory machines you have to consider the whole calculation, not just a little bit at a time. — Ian Bush, Jan 25 '22 at 14:10

score 2 · Accepted Answer · answered Jan 25 '22 at 09:57

As @IanBush has said, your D array is enormous, and you're likely to need some kind of high-memory machine or cluster of machines to process it all at once. However, you probably don't need to process it all at once.

Before we get to that, let's first imagine you don't have any memory issues. For the operation you have described, D looks like an array of N matrices, each of size M*M. You say you have "reshaped the array in order to give the biggest size to the first entry", but for this problem this is the exact opposite of what you want. Fortran is a column-major language, so iterating across the first index of an array is much faster than iterating across the last. In practice, this means that an example triple-loop like

do i=1,N
  do j=1,M
    do k=1,M
      D(i,j,k) = D(i,j,k) +1
    enddo
  enddo
enddo

will run much slower ¹ than the re-ordered triple-loop

do k=1,M
  do j=1,M
    do i=1,N
      D(i,j,k) = D(i,j,k) +1
    enddo
  enddo
enddo

and so you can immediately speed everything up by transposing D and D_mod from N*M*M arrays to M*M*N arrays and rearranging your loops. You can also speed everything up by replacing your manually-written matrix multiplication with matmul (or BLAS/LAPACK), to give

do i=1,N
  D_mod(:,:,i) = matmul(matmul(vec , D(:,:,i)),transpose(vec))
enddo

Now that you're doing matrix multiplication one matrix at a time, you can also find a solution for your memory usage problems: instead of loading everything into memory and trying to do everything at once, just load one D matrix at a time into an M*M array, calculate the corresponding entry of D_mod, and write it out to disk before loading the next matrix.

¹ if your compiler doesn't just optimise the loop order.

Just a question. I know that Fortran is column-major. Isn't the string `D_mod(:,mu,nu)=D_mod(:,mu,nu)+vec(mu,a)*vec(nu,b)*D(:,a,b)` like a loop on the first entry? — Gippo, Jan 25 '22 at 11:44
Sure, you're accessing `D` and `D_mod` efficiently, but at the cost of not being able to optimise the matrix multiplication or separate out the work into smaller chunks. Note that your loop is `O(N*M^4)`, whereas the matmul is `O(N*M^3)`. — veryreverie, Jan 25 '22 at 11:59
[This answer](https://stackoverflow.com/a/16699282/8344060) on Cache friendly code explains why the loop-order affects the speed. — kvantour, Jan 27 '22 at 08:54

Efficient way to compute tensor change of components in Fortran

1 Answers1