1

I tried to write the python code in Optimizing array additions and multiplications with transposes in Fortran to see if I can achieve any speed up (-O3 helps a lot; the approach in Ian Bush's answer in Transposition of a matrix by multithread in Fortran, seems too complicated to me). E.g.,

0.1 * A(l1,l2,l3,l4) + 0.2*A(l1,l2,l4,l3) + 0.3 * A(l1,l3,l2,l4)+...

If I tried to extend from

Program transpose
  
    integer, parameter :: dp = selected_real_kind(15, 307)

    real(dp), dimension(:, :, :, :), allocatable :: a, b
    Integer :: n1, n2, n3, n4, n, m_iter
    Integer :: l1, l2, l3, l4  
    Integer(8) :: start, finish, rate
    real(dp) :: sum_time
    
    Write(*, *) 'n1, n2, n3, n4?'
    Read(*, *) n1, n2, n3, n4

    Allocate( a ( 1:n1, 1:n2, 1:n3, 1:n4 ) )
    Allocate( b ( 1:n1, 1:n2, 1:n3, 1:n4 ) )
    
    call random_init(.true., .false.)
    Call Random_number( a )
    
    m_iter = 100
    b = 0.0_dp 
    Call system_clock( start, rate )
    do n = 1, m_iter  
      do l4 = 1, n4
        do l3 = 1, n3     
          do l2 = 1, n2
            do l1 = 1, n1
              b(l1,l2,l3,l4) = 0.1_dp*a(l1,l2,l3,l4) + 0.2_dp*a(l1,l2,l4,l3)
            end do
          end do                    
        end do
      end do        
    end do
    Call system_clock( finish, rate )
    sum_time =  real( finish - start, dp ) / rate  

    write (*,*) 'all loop', sum_time/m_iter 
    print *, b(1,1,1,1)

  End 

(I tried reshape, slower than nested loops)

Is there any simple way to include A(l1,l3,l2,l4), A(l1,l3,l4,l2) etc? I can use Python to generate a strings to include all of them with \ for changing lines.

A potential complexity is, if there is a term 0.0 * A(l4,l3,l2,l1), and I would like to skip it, generate a string from python is complicated. Any more Fortran-like solution?

Another issue is, if the array A has different dimension in each index, say, n1 != n2 != n3 != n4, some permutation may out of bound. In this situation, the prefactor will be zero. For example, if n1 = n2 = 10, n3 = n4 = 20, it will be something like 0.1 * A(l1,l2,l3,l4) + 0.0 * A(l1,l3,l2,l4). In aother word, b = 0.1*a + 0.0*reshape(a, (/n1, n2, n3, n4/), order = (/1,3,2,4/) ) , or say 0.1*a + 0.0 * P(2,3) a, where P is a permutation operator. By checking the absolute value of permutation prefactor below some threshold, the summation would be able to skip that permutation.

In this case, the prefactor will be zero. The summation is supposed to skip that type of permutation.

Edited: a python reference implementation is below. I include a random and non-random version by the variable, gen_random. The latter may be eaiser to check.

import numpy as np
import time
import itertools as it


ref_list = [0, 1, 2, 3]
p = it.permutations(ref_list)
transpose_list = tuple(p)

n_loop = 2
na = nb = nc = nd = 30


A = np.zeros((na,nb,nc,nd))
gen_random = False
if gen_random == False:
    n = 1
    for la in range(na):
        for lb in range(nb):
            for lc in range(nc):
                for ld in range(nd):
                   A[la,lb,lc,ld] = n
                   n = n + 1          
else:
    A = np.random.random((na,nb,nc,nd))

factor_list = [(i+1)*0.1 for i in range(24)]
time_total = 0.0

for n in range(n_loop):
    sum_A = np.zeros((na,nb,nc,nd))
    start_0 = time.time()
    for m, t in enumerate(transpose_list):
       sum_A = np.add(sum_A, factor_list[m]  * np.transpose(A, transpose_list[m] ), out = sum_A) 
       #sum_A += factor_list[m]  * np.transpose(A, transpose_list[m]) 
    finish_0 = time.time()
    time_total += finish_0 - start_0


print('level 4', time_total/n_loop) 
print('Ref value', A[0,0,0,0], sum_A[0,0,0,0]) 

As a sanity check, if A[0,0,0,0] is non-zero, sum_A[0,0,0,0]/A[0,0,0,0] = 30, by 0.1 + 0.2 +... + 2.4 = (0.1+2.4)*2.4/2=30. Though the permutation factors can be different, the above is just an example.

Geositta
  • 81
  • 7
  • If you have a reference implementation in python it would really help if you can add it two the answer. I can see you want to add across permutations, but what I can't work out is exactly how you decided on which scaling factor to apply to which permutation. – Ian Bush May 03 '22 at 06:23
  • Thanks. I will add the Python implementation. – Geositta May 03 '22 at 06:27
  • Note I don't speak python so it probably won't help me :) It will provide a check for others that what they are doing is correct – Ian Bush May 03 '22 at 06:29
  • The python code has a flaw in the initialization loops, it always runs with `range(na)` instead of with `na`, `nb`, `nc` and `nd`. It doesn't make a difference here as they are all 30 but still .... – albert May 03 '22 at 09:09
  • gfortran will and can optimize the loops away, because you never use `b`. I suggest adding `print *, b(1,1,1,1)` after the outer loop. I'll also suggest use `call random_init(.true., .false.)` before the call to `random_number`. Edit: Forgot to mention use `integer(8)` for the arguments to `system_clock`. This should be much higher resolution. – steve May 03 '22 at 16:56
  • And another thing, `0.1` and `0.2` are single precision constants. You likely want `0.1d0` and `0.2d0`. – steve May 03 '22 at 17:01
  • Or better 0.1_dp and 0.2_dp – Ian Bush May 03 '22 at 17:20
  • Added `call random_init(.true., .false.)`, `integer(8)`, `0.1_dp`, `print *, b(1,1,1,1)`. Variable `b` was used inside the loop, though `b =0.0_dp` is not needed – Geositta May 03 '22 at 17:41
  • I maybe have an answer but can you clarify what happens when all the sizes of a are are are **not** the same - If that is the case permuting such indices can lead to an out of array bounds error. What do you want to happen in such a case? – Ian Bush May 03 '22 at 18:23
  • @IanBush Great point! In this case, the associate prefactor is zero. E.g., for a `3*3*2*2` array `0.1 * A(l1,l2,l3,l4) + 0.0 * A(l1,l3,l2,l4)`. So it will not involve in the summation. I made some other python code including this case, that the code checks if prefactor is nearly zero, then skip that permutation. I will add in the question as well. – Geositta May 03 '22 at 18:33
  • So if an out of bounds is possible is the prefactor zero for *all* of that case, or *only* those elements that would lead to an out of bounds access? – Ian Bush May 03 '22 at 18:36
  • @IanBush Great point! **All** that case, namely, for `3*3*2*2` array, it will look like `b = 0.1*a + 0.0*reshape(a, (/n1, n2, n3, n4/), order = (/1,3,2,4/) ) `, although not actually executing that code (may still out of bound); or `b = 0.1 * a + 0.0 * P[2,3] a`, `P` is a permutation operator. In another word, if a permutation violates the dimension, there is no element in that permutation will be involved in the summation. – Geositta May 03 '22 at 18:42
  • Perhaps I will add non-random part in the Fortran part of the question and provide reference values. – Geositta May 03 '22 at 22:16

1 Answers1

2

Here's one way that I think does it in Fortran, which also skips terms for which the prefactor is zero. I make no claims as to it being the best, there are many ways to do it. I also am hesitant to claim its correctness, what you have provided it makes it difficult to assess this fully. But it does pass the sanity test for the case when all the sizes are the same ... You give no way to check the more general case.

The main problem is Fortran provides no way to create the permutations you require, so I've written a little module which I believe implements the same ordering as python. This is ordering is taken from the python documentation and the algorithm to implement it from wikipedia. Unit testing strongly suggests it does the job.

Once you have that it is easy to loop over each permutation in turn skipping those that have zero weight, either because the prefactor is zero, or the shapes are incompatible. So with all the caveats above here's my effort along with the compiler version and a few tests, some with array bounds checking on, some without.

Note even if this is correct I certainly make not claims as to how optimal it is - the memory access pattern is very non-trivial, and optimising that will require much more thought than I am willing to give this now, though I suspect cache blocking will be required, as in the matrix transposition question that you reference.

Module permutations_module

  ! Little module to handle permutations of an arbitrary sized list of integer 1, 2, 3, .... n
  
  Implicit None

  Type, Public :: permutation
     Private
     Integer, Dimension( : ), Allocatable, Private :: state
   Contains
     Procedure, Public :: init
     Procedure, Public :: get
     Procedure, Public :: next
  End type permutation

  Private

Contains

  Subroutine init( p, n )

    ! Initalise a permutation

    Class( permutation ), Intent(   Out ) :: p
    Integer             , Intent( In    ) :: n

    Integer :: i
    
    Allocate( p%state( 1:n ) )

    p%state = [ ( i, i = 1, Size( p%state ) ) ]

  End Subroutine init

  Pure Function get( p ) Result( a )

    ! Get the current permutation

    Class( permutation ), Intent( In ) :: p

    Integer, Dimension( : ), Allocatable :: a

    a = p%state

  End Function get

  Function next( p ) Result( finished )

    ! Move onto the next permutation, returning .True. if there are no more permutations in the list
    
    Logical :: finished
  
    Class( permutation ), Intent( InOut ) :: p

    Integer :: k, l
    Integer :: tmp

    finished = .False.

    Do k = Size( p%state ) - 1, 1, -1
       If( p%state( k ) < p%state( k + 1 ) ) Exit
    End Do
    finished = k == 0

    If( .Not. finished ) Then
       
       Do l = Size( p%state ), k + 1, -1
          If( p%state( k ) < p%state( l ) ) Exit
       End Do

       tmp = p%state( k )
       p%state( k ) = p%state( l )
       p%state( l ) = tmp

       p%state( k + 1: ) = p%state( Size( p%state ):k + 1: - 1 )

    End If
    
  End Function next

End Module permutations_module

Program testit

  Use, Intrinsic :: iso_fortran_env, Only : wp => real64

  Use permutations_module, Only : permutation

  Implicit None

  Integer, Parameter :: n_iter = 100

  Type( permutation ) :: p

  Integer :: i
  
  Real( wp ), Dimension( :, :, :, : ), Allocatable :: a
  Real( wp ), Dimension( :, :, :, : ), Allocatable :: b
  
  Real( wp ), Dimension( 1:Product( [ ( i, i = 1, Size( Shape( a ) ) ) ] ) ) :: c

  Integer, Dimension( 1:Size( Shape( a ) ) ) :: this_permutation
  Integer, Dimension( 1:Size( Shape( a ) ) ) :: sizes
  Integer, Dimension( 1:Size( Shape( a ) ) ) :: permuted_sizes
  Integer, Dimension( 1:Size( Shape( a ) ) ) :: indices
  Integer, Dimension( 1:Size( Shape( a ) ) ) :: permuted_indices
  
  Integer :: n1, n2, n3, n4
  Integer :: l1, l2, l3, l4
  Integer :: iter
  Integer :: start, finish, rate
  
  Logical :: finished

  c = [ ( i * 0.1_wp, i = 1, Size( c ) ) ]

  Write( *, * ) 'n1, n2, n3, n4?'
  Read ( *, * ) n1, n2, n3, n4

  Allocate( a ( 1:n1, 1:n2, 1:n3, 1:n4 ) )
  Allocate( b, Mold = a ) 

  Call Random_init( .true., .false. )
  Call Random_number( a )
  ! Make sure a( 1, 1, 1, 1 ) is not zero for the sanity check
  a( 1, 1, 1, 1 ) = a( 1, 1, 1, 1 ) + 0.1_wp

  Call system_clock( start, rate )
  sizes = Shape( a )
  b = 0.0_wp

  iter_loop: Do iter = 1, n_iter

     Call p%init( Size( Shape( a ) ) )
     i = 0
     finished = .False.
     permutation_loop: Do While( .Not. finished )
        i = i + 1

        ! Get the next permutation
        finished = p%next()

        ! Only do it if it has any weight
        If( Abs( c( i ) ) > Epsilon( c( i ) ) ) Then

           ! Get the current permutation
           this_permutation = p%get()

           ! Check the shapes are compatible
           permuted_sizes = sizes( this_permutation )

           If( All( permuted_sizes == sizes ) ) Then

              ! Add in the current permutation
              Do l4 = 1, n4
                 Do l3 = 1, n3     
                    Do l2 = 1, n2
                       Do l1 = 1, n1
                          indices = [ l1, l2, l3, l4 ]
                          permuted_indices = indices( this_permutation )
                          b( indices( 1 ), indices( 2 ), indices( 3 ), indices( 4 ) ) = &
                               b(indices( 1 ), indices( 2 ), indices( 3 ), indices( 4 )  ) + &
                               c( i ) * a( permuted_indices( 1 ), permuted_indices( 2 ), &
                               permuted_indices( 3 ), permuted_indices( 4 ) )
                       End Do
                    End Do
                 End Do
              End Do
              
           End If
              
        End If
     End Do permutation_loop
     
  End Do iter_loop
  
  Call system_clock( finish, rate )
  Write( *, * ) 'time per iteration = ', Real( finish - start ) / Real( rate ) / Real( n_iter )
  Write( *, * ) 'Sanity ', b( 1, 1, 1, 1 ) / a( 1, 1, 1, 1 ) / n_iter
     
End Program testit
ijb@ijb-Latitude-5410:~/work/stack$ gfortran --version
GNU Fortran (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0
Copyright (C) 2019 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

ijb@ijb-Latitude-5410:~/work/stack$ gfortran  -fcheck=all -Wall -Wextra -O3 -g -std=f2018 perm_mod_single_thread.f90
ijb@ijb-Latitude-5410:~/work/stack$ ./a.out
 n1, n2, n3, n4?
10 10 10 10
 time per iteration =    1.47000002E-03
 Sanity    30.000000000000259     
ijb@ijb-Latitude-5410:~/work/stack$ ./a.out
 n1, n2, n3, n4?
11 10 9 12
 time per iteration =    0.00000000    
 Sanity    0.0000000000000000     
ijb@ijb-Latitude-5410:~/work/stack$ gfortran -Wall -Wextra -O3 -g -std=f2018 perm_mod_single_thread.f90
ijb@ijb-Latitude-5410:~/work/stack$ ./a.out
 n1, n2, n3, n4?
30 30 30 30
 time per iteration =    6.56599998E-02
 Sanity    29.999999999999844     
ijb@ijb-Latitude-5410:~/work/stack$ ./a.out
 n1, n2, n3, n4?
60 60 60 60
 time per iteration =    2.46800995    
 Sanity    30.000000000000036     
ijb@ijb-Latitude-5410:~/work/stack$ ./a.out
 n1, n2, n3, n4?
10 20 30 40
 time per iteration =    2.00000013E-05
 Sanity    0.0000000000000000     
ijb@ijb-Latitude-5410:~/work/stack$ ./a.out
 n1, n2, n3, n4?
30 30 15 15
 time per iteration =    1.50999997E-03
 Sanity    1.4000000000000050     
ijb@ijb-Latitude-5410:~/work/stack$ 
Ian Bush
  • 6,996
  • 1
  • 21
  • 27
  • Thank you so much! I should have added non-random part in my Fortran code in the question and provided some reference values. Here I learned the square bracket can be used to define array since `Fortran 2003` and colon before do loop. Perhaps that's highly non trivial to optimize the performance, I got `time per iteration = 0.150659993` with `gfortran 9.3.0` with your flags `-Wall` etc. If I use `gfortran -O3 -ffast-math perm_mod_single_thread.f90`, I got `time per iteration = 8.57800022E-02`. The python performance is `level 4 0.113`, where 0.113.. is the time in second – Geositta May 03 '22 at 22:05
  • I tried `Julia` at some stage and I got `74.16 ms`. But going to higher dimensional array permutations, e.g., rank-5, `python` vs `julia` is `2.3` vs `3.4` sec, respectively. Maybe fortran can be faster. Maybe it is about memory catch things to really improve the performance than which program language is used. Thank you once again! There are many things I can learn! – Geositta May 03 '22 at 22:21
  • It doesn't really change the answer but I realised overnight I have been naughty and taken the `Shape` of an unallocated allocatable array - if I find time I will fix, doesn't affect the main loops. – Ian Bush May 04 '22 at 06:17