Efficient tensor contractions by BLAS

Question

C[a,d,e] = A[a,b,c] * B[d,b,c,e] The topic is relate to BLAS tensor contractions for two indexes together , but more complicated.

I can only use do loop to call gemm for many times. The data structure of C and B is not contiguous in current implementation.

I want to find a more efficient way to compute C[a,d,e] = A[a,b,c] * B[d,b,c,e]. Hopefully, call gemm only one time.

Program double_contractions_blas

  Use, Intrinsic :: iso_fortran_env, Only :  wp => real64, li => int64

  Implicit None

  Real( wp ), Dimension( :, :, : ), Allocatable :: a
  Real( wp ), Dimension( :, :, :, : ), Allocatable :: b
  Real( wp ), Dimension( :, :, : ), Allocatable :: c
  Real( wp ), Dimension( :, :    ), Allocatable :: d
  Real( wp ), Dimension( :, :    ), Allocatable :: e

  Integer :: na, nb, nc, nd, ne, nf, ng
  Integer :: i

  Integer( li ) :: start, finish, rate

  Write( *, * ) 'na, nb, nc, nd, ne ?'
  Read( *, * ) na, nb, nc, nd, ne

  allocate( a( na, nb, nc ) )
  allocate( b( nd, nb, nc, ne ) )
  allocate( c( na, nd, ne ) )

  nf = nd * ne
  ng = nb * nc
  Call Random_number( a )
  Call Random_number( b )

  Call System_clock( start, rate )
  Do i = 1, nd
    Call dgemm( 'N', 'N', na, ne, ng, 1.0_wp, a , Size( a , Dim = 1 ), &
                                              b(i,:,:,:) , ng, &
                                              0.0_wp, c(:,i,:), Size( c, Dim = 1 ) )
  end do

  Call System_clock( finish, rate )

  write( *, * ) c

  end program double_contractions_blas

I would explicitly write all the loops (i.e. replacing `dgemm()` by the equivalent triple loop in your code), then try rearranging the loops to make the best possible use of the cache. That said, is your current code mathematically correct? You are multiplying `a` by `b(i,:,:,:)`, which are 3-dimentional arrays, while `dgemm()` is made for 2-dimentional arrays. — PierU, Nov 21 '22 at 09:11
I will try your triple loop, but I cannot expect that will be faster. I am sure that I use dgemm() correctly, please have a look athttps://stackoverflow.com/questions/74503623/blas-tensor-contractions-for-two-indexes-together?noredirect=1#comment131518351_74503623 @Peter — Mao Yang, Nov 21 '22 at 09:49
Simply replacing `dgemm()` by the equivalent triple loops won't make the code faster in itself. But then you will be able to possibly rearrange the loop ordering to work on more local pieces of data. At the moment your outer loop index `i` correspond to the first dimension of `b`, which is the worst possible situation. — PierU, Nov 21 '22 at 10:37

Efficient tensor contractions by BLAS

0 Answers0