Transposition of a matrix by multithread in Fortran

Question

I am calculating transposition of a very large dimension matrix using Fortran, P=TRANSPOSE(PP). I see that the built-in function TRANSPOSE in Fortran is very slow. I want to speed up by parallelizing the code as follows:

subroutine TP(nstate,P,PP)
integer i,j,nstate
double precision P(nstate,nstate),PP(nstate,nstate)
!$omp parallel shared ( P, PP,nstate) private (i, j)
!$omp do
do i=1,nstate
    do j=1,nstate
        P(j,i) = PP(i,j)
    end do
end do
!$omp end do
!$omp end parallel do
end subroutine TP(nstate,P,PP)

Unfortunately, my code uses just only 1 thread and there isn't any improvement.

Please specify how you compile your code. Also use the `collapse(2)` omp statement, otherwise only one loop is parallelized. — jack, Jul 12 '21 at 08:56
Quite often the fastest way to transpose a matrix is not to transpose the matrix. Are you sure you need to do that in your algorithm? — francescalus, Jul 12 '21 at 10:03
@jack I I uses visual studio, I also turn on Maximize Speed plus Higher Level Optimizations (/O3) in Project Properties — nvh10, Jul 12 '21 at 10:38
Most compiler environments don't turn on OpenMP by default. My guess is that is what is happening here, you will need to somewhere explicitly state that you want OpenMP. I don't know how you do that in Visual Studio, sorry, I only use the command line. — Ian Bush, Jul 12 '21 at 10:40
It would also help to give us an idea of what you mean by "very large". For me that would be about 50000+. For others it will be different — Ian Bush, Jul 12 '21 at 12:45

Ian Bush · Answer 1 · 2021-07-13T12:53:05.400

To answer your question about threads the most likely reason is that you will have to set up the compilation process to use OpenMP. All compilers I know do not use OpenMP by default, you have to explicitly turn it on.

However your real question is about speeding up a matrix transpose, and if I were doing this multithreading is about the third thing I would look at. The first would be, as noted in the comments, can I rewrite my algorithm without explicitly forming the transpose? In my experience you rarely have to explicitly form the transpose. For instance many BLAS and LAPACK routines allow you to specify that you want to use the transposed form of the matrix, so you never actually have to take the transpose. Of course whether you can or not depends upon the exact details of what you are doing, but not doing it is almost certainly the fastest way!

I would next look at the single threaded performance. A matrix transpose is one of the worst things you can do with a modern memory subsystem, so throwing more cores at it is not likely to improve things much as the limiting factor is the speed of the memory, not how fast you can do computations. Now for small matrices, smaller than the size of the cache, it's probably not an issue - the cache will save you. But for large matrices, bigger than the cache, you will run into problems. The way to address this is to break the operation into smaller blocks and transpose each block in turn - and choose the block size so it is smaller than the cache. Below is a code which does this in two slightly different ways, and some single threaded results on my laptop

! transpose.f90
Program tran

  Use, Intrinsic :: iso_fortran_env, Only : wp => real64, li => int64

  Use omp_lib, Only : omp_get_max_threads

  Implicit None ( type, external )

  Real( wp ), Dimension( :, : ), Allocatable :: a, aT

  Real( wp ) :: t, t_in, t_in1, t_in2
  Real( wp ) :: error
  
  Integer :: start, finish, rate

  Integer :: bfac
  Integer :: n

  Write( *, * ) 'n ?'
  Read ( *, * ) n
  
  Allocate( a ( 1:n, 1:n ) )
  Allocate( aT( 1:n, 1:n ) )

  Call Random_number( a )

  Write( *, * ) 'Size of matrix ', n
  Write( *, * ) 'Using ', omp_get_max_threads(), ' threads'

  Call System_clock( start, rate )
  aT = Transpose( a )
  Call System_clock( finish, rate )
  error = Maxval( Abs( Transpose( a ) - aT ) )
  t_in1 = Real( finish - start, wp ) / rate
  Write( *, * ) 'Intrinsic 1: ', t_in1, ' Error: ', error

  Call System_clock( start, rate )
  aT = Transpose( a )
  Call System_clock( finish, rate )
  error = Maxval( Abs( Transpose( a ) - aT ) )
  t_in2 = Real( finish - start, wp ) / rate
  Write( *, * ) 'Intrinsic 2: ', t_in2, ' Error: ', error

  t_in = 0.5_wp * ( t_in1 + t_in2 )

  Write( *, * ) 'Read bound'
  bfac = 10
  Do While( bfac <= n )
     aT = -2.0_wp
     Call System_clock( start, rate )
     Call blocked_transpose_rb( a, bfac, aT )
     Call System_clock( finish, rate )
     error = Maxval( Abs( Transpose( a ) - aT ) )
     t = Real( finish - start, wp ) / rate
     Write( *, * ) bfac, ' Intrinsic: ', t, ' Error: ', error, ' Speed up ', t_in / t
     bfac = Nint( bfac * 1.5_wp )
  End Do

  Write( *, * ) 'Write bound'
  bfac = 10
  Do While( bfac <= n )
     aT = -2.0_wp
     Call System_clock( start, rate )
     Call blocked_transpose_wb( a, bfac, aT )
     Call System_clock( finish, rate )
     error = Maxval( Abs( Transpose( a ) - aT ) )
     t = Real( finish - start, wp ) / rate
     Write( *, * ) bfac, ' Intrinsic: ', t, ' Error: ', error, ' Speed up: ', t_in / t
     bfac = Nint( bfac * 1.5_wp )
  End Do

Contains

  Subroutine  blocked_transpose_rb( a, bfac, aT )

    Use, Intrinsic :: iso_fortran_env, Only : wp => real64, li => int64

    Real( wp ), Dimension( :, : ), Intent( In    ) :: a
    Integer                      , Intent( In    ) :: bfac
    Real( wp ), Dimension( :, : ), Intent(   Out ) :: aT

    Integer :: n
    Integer :: ib, jb
    Integer :: i , j

    n = Ubound( a, Dim = 1 )

    !$omp parallel default( none ) shared( n, bfac, a, AT ), private( ib, jb, i, j )
    !$omp do collapse( 2 )
    Do ib = 1, n, bfac
       Do jb = 1, n, bfac
          Do i = ib, Min( n, ib + bfac - 1 )
             Do j = jb, Min( n, jb + bfac - 1 )
                aT( j, i ) = a( i, j )
             End Do
          End Do
       End Do
    End Do
    !$omp end do
    !$omp end parallel
    
  End Subroutine blocked_transpose_rb
     
  Subroutine  blocked_transpose_wb( a, bfac, aT )

    Use, Intrinsic :: iso_fortran_env, Only : wp => real64, li => int64

    Real( wp ), Dimension( :, : ), Intent( In    ) :: a
    Integer                      , Intent( In    ) :: bfac
    Real( wp ), Dimension( :, : ), Intent(   Out ) :: aT

    Integer :: n
    Integer :: ib, jb
    Integer :: i , j

    n = Ubound( a, Dim = 1 )

    !$omp parallel default( none ) shared( n, bfac, a, AT ), private( ib, jb, i, j )
    !$omp do collapse( 2 )
    Do ib = 1, n, bfac
       Do jb = 1, n, bfac
          Do i = ib, Min( n, ib + bfac - 1 )
             Do j = jb, Min( n, jb + bfac - 1 )
                aT( i, j ) = a( j, i )
             End Do
          End Do
       End Do
    End Do
    !$omp end do
    !$omp end parallel
    
  End Subroutine blocked_transpose_wb
     
End Program tran

Compilation

ijb@ijb-Latitude-5410:~/work/stack$ gfortran --version
GNU Fortran (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0
Copyright (C) 2019 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
ijb@ijb-Latitude-5410:~/work/stack$ gfortran -O3 -fopenmp -Wall -Wextra -g transpose.f90 -o transpose

Code execution

ijb@ijb-Latitude-5410:~/work/stack$ export OMP_NUM_THREADS=1; ./transpose < in
 n ?
 Size of matrix        20000
 Using            1  threads
 Intrinsic 1:    9.4619999999999997       Error:    0.0000000000000000     
 Intrinsic 2:    8.6950000000000003       Error:    0.0000000000000000     
 Read bound
          10  Intrinsic:    2.2450000000000001       Error:    0.0000000000000000       Speed up    4.0438752783964365     
          15  Intrinsic:    2.0019999999999998       Error:    0.0000000000000000       Speed up    4.5347152847152854     
          23  Intrinsic:    1.8210000000000000       Error:    0.0000000000000000       Speed up    4.9854475562877543     
          35  Intrinsic:    1.4319999999999999       Error:    0.0000000000000000       Speed up    6.3397346368715084     
          53  Intrinsic:    1.2629999999999999       Error:    0.0000000000000000       Speed up    7.1880443388756934     
          80  Intrinsic:    1.2470000000000001       Error:    0.0000000000000000       Speed up    7.2802726543704885     
         120  Intrinsic:    1.1870000000000001       Error:    0.0000000000000000       Speed up    7.6482729570345409     
         180  Intrinsic:    1.1990000000000001       Error:    0.0000000000000000       Speed up    7.5717264386989154     
         270  Intrinsic:    1.1750000000000000       Error:    0.0000000000000000       Speed up    7.7263829787234037     
         405  Intrinsic:    1.1510000000000000       Error:    0.0000000000000000       Speed up    7.8874891398783662     
         608  Intrinsic:    1.1319999999999999       Error:    0.0000000000000000       Speed up    8.0198763250883403     
         912  Intrinsic:    1.2200000000000000       Error:    0.0000000000000000       Speed up    7.4413934426229513     
        1368  Intrinsic:    1.4050000000000000       Error:    0.0000000000000000       Speed up    6.4615658362989326     
        2052  Intrinsic:    3.2240000000000002       Error:    0.0000000000000000       Speed up    2.8159119106699748     
        3078  Intrinsic:    3.8510000000000000       Error:    0.0000000000000000       Speed up    2.3574396260711503     
        4617  Intrinsic:    4.0990000000000002       Error:    0.0000000000000000       Speed up    2.2148084898755793     
        6926  Intrinsic:    4.8529999999999998       Error:    0.0000000000000000       Speed up    1.8706985369874305     
       10389  Intrinsic:    5.5330000000000004       Error:    0.0000000000000000       Speed up    1.6407916139526477     
       15584  Intrinsic:    5.9450000000000003       Error:    0.0000000000000000       Speed up    1.5270815811606391     
 Write bound
          10  Intrinsic:    1.5669999999999999       Error:    0.0000000000000000       Speed up:    5.7935545628589660     
          15  Intrinsic:    1.5389999999999999       Error:    0.0000000000000000       Speed up:    5.8989603638726447     
          23  Intrinsic:    1.4190000000000000       Error:    0.0000000000000000       Speed up:    6.3978153629316417     
          35  Intrinsic:    1.2110000000000001       Error:    0.0000000000000000       Speed up:    7.4966969446738227     
          53  Intrinsic:    1.3109999999999999       Error:    0.0000000000000000       Speed up:    6.9248665141113657     
          80  Intrinsic:    1.0700000000000001       Error:    0.0000000000000000       Speed up:    8.4845794392523359     
         120  Intrinsic:    1.0320000000000000       Error:    0.0000000000000000       Speed up:    8.7969961240310077     
         180  Intrinsic:    1.1960000000000000       Error:    0.0000000000000000       Speed up:    7.5907190635451505     
         270  Intrinsic:    1.2350000000000001       Error:    0.0000000000000000       Speed up:    7.3510121457489870     
         405  Intrinsic:    1.2480000000000000       Error:    0.0000000000000000       Speed up:    7.2744391025641022     
         608  Intrinsic:    1.2849999999999999       Error:    0.0000000000000000       Speed up:    7.0649805447470824     
         912  Intrinsic:    1.4750000000000001       Error:    0.0000000000000000       Speed up:    6.1549152542372880     
        1368  Intrinsic:    1.6970000000000001       Error:    0.0000000000000000       Speed up:    5.3497348261638180     
        2052  Intrinsic:    2.5249999999999999       Error:    0.0000000000000000       Speed up:    3.5954455445544555     
        3078  Intrinsic:    2.9569999999999999       Error:    0.0000000000000000       Speed up:    3.0701724721001016     
        4617  Intrinsic:    3.1490000000000000       Error:    0.0000000000000000       Speed up:    2.8829787234042552     
        6926  Intrinsic:    3.6120000000000001       Error:    0.0000000000000000       Speed up:    2.5134274640088594     
       10389  Intrinsic:    3.7269999999999999       Error:    0.0000000000000000       Speed up:    2.4358733565870674     
       15584  Intrinsic:    4.4550000000000001       Error:    0.0000000000000000       Speed up:    2.0378226711560044

The speed up quoted is how much faster the code ran for a giving blocking factor compared to the Transpose(), so a speed up of 2 means my code ran twice as fast as the intrinsic, and you can see that just by restructuring the loops you can get an improvement of 8+ [yes, I know I'm being a little unfair to the intrinsic here, but the point remains]; that's equivalent to an awful lot of threads! Further if you look at the read bound version, which should reflect cache use best, you can see the maximum speed is at a blocking factor of 608. Given each transpose requires two blocks of size (blocking factor)*(blocking factor) this means for bfac=608 it needs (608*608*8*2)/(1024*1024)=5.64 Mbytes, a little under the L3 cache size of my machine (8Mbytes). Thus in this case it is more important to consider how you use your memory than how you use your cores.

Of course you can then multithread. Below are the results on my laptop, again in terms of speed up relative to the intrinsic. Moving to 2 threads gets you a little more, but not as much as the improvement above.

Great answer! If I may give a feedback regarding the plot: a log-scaled x-axis will make it easier to read the data. — jack, Jul 13 '21 at 12:42
@jack Thanks for the kind words! However as I find the main use of log plots to be obscuring dodgy data by unreadable axes I reuse to use them. — Ian Bush, Jul 13 '21 at 15:09

John Alexiou · Answer 2 · 2021-07-12T23:51:53.723

0

You can speed up the algorithm by swapping the upper triangular elements with the lower triangular elements.

subroutine TP(nstate,P,PP)
implicit none
integer, intent(in) :: nstate
double precision, intent(in) :: PP(nstate,nstate)
double precision :: P(nstate,nstate)
integer :: i,j
!$omp parallel shared ( P, PP,nstate) private (i, j)
!$omp do
do i=1,nstate
    do j=i,nstate           ! go from 'i' to 'n'
        P(i,j) = PP(j,i)
        P(j,i) = PP(i,j)
    end do
end do
!$omp end do
!$omp end parallel do
end subroutine

edited Jul 12 '21 at 23:51

answered Jul 12 '21 at 23:24

John Alexiou

28,472
11
77
133

@francescalus - oops I was thinking to re-arrange an existing matrix. – John Alexiou Jul 12 '21 at 23:42

Transposition of a matrix by multithread in Fortran

2 Answers2

Linked