I tried to write the python code in Optimizing array additions and multiplications with transposes
in Fortran to see if I can achieve any speed up (-O3
helps a lot; the approach in Ian Bush's answer in Transposition of a matrix by multithread in Fortran, seems too complicated to me). E.g.,
0.1 * A(l1,l2,l3,l4) + 0.2*A(l1,l2,l4,l3) + 0.3 * A(l1,l3,l2,l4)+...
If I tried to extend from
Program transpose
integer, parameter :: dp = selected_real_kind(15, 307)
real(dp), dimension(:, :, :, :), allocatable :: a, b
Integer :: n1, n2, n3, n4, n, m_iter
Integer :: l1, l2, l3, l4
Integer(8) :: start, finish, rate
real(dp) :: sum_time
Write(*, *) 'n1, n2, n3, n4?'
Read(*, *) n1, n2, n3, n4
Allocate( a ( 1:n1, 1:n2, 1:n3, 1:n4 ) )
Allocate( b ( 1:n1, 1:n2, 1:n3, 1:n4 ) )
call random_init(.true., .false.)
Call Random_number( a )
m_iter = 100
b = 0.0_dp
Call system_clock( start, rate )
do n = 1, m_iter
do l4 = 1, n4
do l3 = 1, n3
do l2 = 1, n2
do l1 = 1, n1
b(l1,l2,l3,l4) = 0.1_dp*a(l1,l2,l3,l4) + 0.2_dp*a(l1,l2,l4,l3)
end do
end do
end do
end do
end do
Call system_clock( finish, rate )
sum_time = real( finish - start, dp ) / rate
write (*,*) 'all loop', sum_time/m_iter
print *, b(1,1,1,1)
End
(I tried reshape
, slower than nested loops)
Is there any simple way to include A(l1,l3,l2,l4)
, A(l1,l3,l4,l2)
etc? I can use Python
to generate a strings to include all of them with \
for changing lines.
A potential complexity is, if there is a term 0.0 * A(l4,l3,l2,l1), and I would like to skip it, generate a string from python is complicated. Any more Fortran-like solution?
Another issue is, if the array A
has different dimension in each index, say, n1 != n2 != n3 != n4
, some permutation may out of bound. In this situation, the prefactor will be zero. For example, if n1 = n2 = 10
, n3 = n4 = 20
, it will be something like 0.1 * A(l1,l2,l3,l4) + 0.0 * A(l1,l3,l2,l4)
. In aother word, b = 0.1*a + 0.0*reshape(a, (/n1, n2, n3, n4/), order = (/1,3,2,4/) )
, or say 0.1*a + 0.0 * P(2,3) a
, where P
is a permutation operator. By checking the absolute value of permutation prefactor below some threshold, the summation would be able to skip that permutation.
In this case, the prefactor will be zero. The summation is supposed to skip that type of permutation.
Edited: a python reference implementation is below. I include a random and non-random version by the variable, gen_random
. The latter may be eaiser to check.
import numpy as np
import time
import itertools as it
ref_list = [0, 1, 2, 3]
p = it.permutations(ref_list)
transpose_list = tuple(p)
n_loop = 2
na = nb = nc = nd = 30
A = np.zeros((na,nb,nc,nd))
gen_random = False
if gen_random == False:
n = 1
for la in range(na):
for lb in range(nb):
for lc in range(nc):
for ld in range(nd):
A[la,lb,lc,ld] = n
n = n + 1
else:
A = np.random.random((na,nb,nc,nd))
factor_list = [(i+1)*0.1 for i in range(24)]
time_total = 0.0
for n in range(n_loop):
sum_A = np.zeros((na,nb,nc,nd))
start_0 = time.time()
for m, t in enumerate(transpose_list):
sum_A = np.add(sum_A, factor_list[m] * np.transpose(A, transpose_list[m] ), out = sum_A)
#sum_A += factor_list[m] * np.transpose(A, transpose_list[m])
finish_0 = time.time()
time_total += finish_0 - start_0
print('level 4', time_total/n_loop)
print('Ref value', A[0,0,0,0], sum_A[0,0,0,0])
As a sanity check, if A[0,0,0,0]
is non-zero, sum_A[0,0,0,0]/A[0,0,0,0] = 30
, by 0.1 + 0.2 +... + 2.4 = (0.1+2.4)*2.4/2=30
. Though the permutation factors can be different, the above is just an example.