To answer your question about threads the most likely reason is that you will have to set up the compilation process to use OpenMP. All compilers I know do not use OpenMP by default, you have to explicitly turn it on.
However your real question is about speeding up a matrix transpose, and if I were doing this multithreading is about the third thing I would look at. The first would be, as noted in the comments, can I rewrite my algorithm without explicitly forming the transpose? In my experience you rarely have to explicitly form the transpose. For instance many BLAS and LAPACK routines allow you to specify that you want to use the transposed form of the matrix, so you never actually have to take the transpose. Of course whether you can or not depends upon the exact details of what you are doing, but not doing it is almost certainly the fastest way!
I would next look at the single threaded performance. A matrix transpose is one of the worst things you can do with a modern memory subsystem, so throwing more cores at it is not likely to improve things much as the limiting factor is the speed of the memory, not how fast you can do computations. Now for small matrices, smaller than the size of the cache, it's probably not an issue - the cache will save you. But for large matrices, bigger than the cache, you will run into problems. The way to address this is to break the operation into smaller blocks and transpose each block in turn - and choose the block size so it is smaller than the cache. Below is a code which does this in two slightly different ways, and some single threaded results on my laptop
! transpose.f90
Program tran
Use, Intrinsic :: iso_fortran_env, Only : wp => real64, li => int64
Use omp_lib, Only : omp_get_max_threads
Implicit None ( type, external )
Real( wp ), Dimension( :, : ), Allocatable :: a, aT
Real( wp ) :: t, t_in, t_in1, t_in2
Real( wp ) :: error
Integer :: start, finish, rate
Integer :: bfac
Integer :: n
Write( *, * ) 'n ?'
Read ( *, * ) n
Allocate( a ( 1:n, 1:n ) )
Allocate( aT( 1:n, 1:n ) )
Call Random_number( a )
Write( *, * ) 'Size of matrix ', n
Write( *, * ) 'Using ', omp_get_max_threads(), ' threads'
Call System_clock( start, rate )
aT = Transpose( a )
Call System_clock( finish, rate )
error = Maxval( Abs( Transpose( a ) - aT ) )
t_in1 = Real( finish - start, wp ) / rate
Write( *, * ) 'Intrinsic 1: ', t_in1, ' Error: ', error
Call System_clock( start, rate )
aT = Transpose( a )
Call System_clock( finish, rate )
error = Maxval( Abs( Transpose( a ) - aT ) )
t_in2 = Real( finish - start, wp ) / rate
Write( *, * ) 'Intrinsic 2: ', t_in2, ' Error: ', error
t_in = 0.5_wp * ( t_in1 + t_in2 )
Write( *, * ) 'Read bound'
bfac = 10
Do While( bfac <= n )
aT = -2.0_wp
Call System_clock( start, rate )
Call blocked_transpose_rb( a, bfac, aT )
Call System_clock( finish, rate )
error = Maxval( Abs( Transpose( a ) - aT ) )
t = Real( finish - start, wp ) / rate
Write( *, * ) bfac, ' Intrinsic: ', t, ' Error: ', error, ' Speed up ', t_in / t
bfac = Nint( bfac * 1.5_wp )
End Do
Write( *, * ) 'Write bound'
bfac = 10
Do While( bfac <= n )
aT = -2.0_wp
Call System_clock( start, rate )
Call blocked_transpose_wb( a, bfac, aT )
Call System_clock( finish, rate )
error = Maxval( Abs( Transpose( a ) - aT ) )
t = Real( finish - start, wp ) / rate
Write( *, * ) bfac, ' Intrinsic: ', t, ' Error: ', error, ' Speed up: ', t_in / t
bfac = Nint( bfac * 1.5_wp )
End Do
Contains
Subroutine blocked_transpose_rb( a, bfac, aT )
Use, Intrinsic :: iso_fortran_env, Only : wp => real64, li => int64
Real( wp ), Dimension( :, : ), Intent( In ) :: a
Integer , Intent( In ) :: bfac
Real( wp ), Dimension( :, : ), Intent( Out ) :: aT
Integer :: n
Integer :: ib, jb
Integer :: i , j
n = Ubound( a, Dim = 1 )
!$omp parallel default( none ) shared( n, bfac, a, AT ), private( ib, jb, i, j )
!$omp do collapse( 2 )
Do ib = 1, n, bfac
Do jb = 1, n, bfac
Do i = ib, Min( n, ib + bfac - 1 )
Do j = jb, Min( n, jb + bfac - 1 )
aT( j, i ) = a( i, j )
End Do
End Do
End Do
End Do
!$omp end do
!$omp end parallel
End Subroutine blocked_transpose_rb
Subroutine blocked_transpose_wb( a, bfac, aT )
Use, Intrinsic :: iso_fortran_env, Only : wp => real64, li => int64
Real( wp ), Dimension( :, : ), Intent( In ) :: a
Integer , Intent( In ) :: bfac
Real( wp ), Dimension( :, : ), Intent( Out ) :: aT
Integer :: n
Integer :: ib, jb
Integer :: i , j
n = Ubound( a, Dim = 1 )
!$omp parallel default( none ) shared( n, bfac, a, AT ), private( ib, jb, i, j )
!$omp do collapse( 2 )
Do ib = 1, n, bfac
Do jb = 1, n, bfac
Do i = ib, Min( n, ib + bfac - 1 )
Do j = jb, Min( n, jb + bfac - 1 )
aT( i, j ) = a( j, i )
End Do
End Do
End Do
End Do
!$omp end do
!$omp end parallel
End Subroutine blocked_transpose_wb
End Program tran
Compilation
ijb@ijb-Latitude-5410:~/work/stack$ gfortran --version
GNU Fortran (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0
Copyright (C) 2019 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
ijb@ijb-Latitude-5410:~/work/stack$ gfortran -O3 -fopenmp -Wall -Wextra -g transpose.f90 -o transpose
Code execution
ijb@ijb-Latitude-5410:~/work/stack$ export OMP_NUM_THREADS=1; ./transpose < in
n ?
Size of matrix 20000
Using 1 threads
Intrinsic 1: 9.4619999999999997 Error: 0.0000000000000000
Intrinsic 2: 8.6950000000000003 Error: 0.0000000000000000
Read bound
10 Intrinsic: 2.2450000000000001 Error: 0.0000000000000000 Speed up 4.0438752783964365
15 Intrinsic: 2.0019999999999998 Error: 0.0000000000000000 Speed up 4.5347152847152854
23 Intrinsic: 1.8210000000000000 Error: 0.0000000000000000 Speed up 4.9854475562877543
35 Intrinsic: 1.4319999999999999 Error: 0.0000000000000000 Speed up 6.3397346368715084
53 Intrinsic: 1.2629999999999999 Error: 0.0000000000000000 Speed up 7.1880443388756934
80 Intrinsic: 1.2470000000000001 Error: 0.0000000000000000 Speed up 7.2802726543704885
120 Intrinsic: 1.1870000000000001 Error: 0.0000000000000000 Speed up 7.6482729570345409
180 Intrinsic: 1.1990000000000001 Error: 0.0000000000000000 Speed up 7.5717264386989154
270 Intrinsic: 1.1750000000000000 Error: 0.0000000000000000 Speed up 7.7263829787234037
405 Intrinsic: 1.1510000000000000 Error: 0.0000000000000000 Speed up 7.8874891398783662
608 Intrinsic: 1.1319999999999999 Error: 0.0000000000000000 Speed up 8.0198763250883403
912 Intrinsic: 1.2200000000000000 Error: 0.0000000000000000 Speed up 7.4413934426229513
1368 Intrinsic: 1.4050000000000000 Error: 0.0000000000000000 Speed up 6.4615658362989326
2052 Intrinsic: 3.2240000000000002 Error: 0.0000000000000000 Speed up 2.8159119106699748
3078 Intrinsic: 3.8510000000000000 Error: 0.0000000000000000 Speed up 2.3574396260711503
4617 Intrinsic: 4.0990000000000002 Error: 0.0000000000000000 Speed up 2.2148084898755793
6926 Intrinsic: 4.8529999999999998 Error: 0.0000000000000000 Speed up 1.8706985369874305
10389 Intrinsic: 5.5330000000000004 Error: 0.0000000000000000 Speed up 1.6407916139526477
15584 Intrinsic: 5.9450000000000003 Error: 0.0000000000000000 Speed up 1.5270815811606391
Write bound
10 Intrinsic: 1.5669999999999999 Error: 0.0000000000000000 Speed up: 5.7935545628589660
15 Intrinsic: 1.5389999999999999 Error: 0.0000000000000000 Speed up: 5.8989603638726447
23 Intrinsic: 1.4190000000000000 Error: 0.0000000000000000 Speed up: 6.3978153629316417
35 Intrinsic: 1.2110000000000001 Error: 0.0000000000000000 Speed up: 7.4966969446738227
53 Intrinsic: 1.3109999999999999 Error: 0.0000000000000000 Speed up: 6.9248665141113657
80 Intrinsic: 1.0700000000000001 Error: 0.0000000000000000 Speed up: 8.4845794392523359
120 Intrinsic: 1.0320000000000000 Error: 0.0000000000000000 Speed up: 8.7969961240310077
180 Intrinsic: 1.1960000000000000 Error: 0.0000000000000000 Speed up: 7.5907190635451505
270 Intrinsic: 1.2350000000000001 Error: 0.0000000000000000 Speed up: 7.3510121457489870
405 Intrinsic: 1.2480000000000000 Error: 0.0000000000000000 Speed up: 7.2744391025641022
608 Intrinsic: 1.2849999999999999 Error: 0.0000000000000000 Speed up: 7.0649805447470824
912 Intrinsic: 1.4750000000000001 Error: 0.0000000000000000 Speed up: 6.1549152542372880
1368 Intrinsic: 1.6970000000000001 Error: 0.0000000000000000 Speed up: 5.3497348261638180
2052 Intrinsic: 2.5249999999999999 Error: 0.0000000000000000 Speed up: 3.5954455445544555
3078 Intrinsic: 2.9569999999999999 Error: 0.0000000000000000 Speed up: 3.0701724721001016
4617 Intrinsic: 3.1490000000000000 Error: 0.0000000000000000 Speed up: 2.8829787234042552
6926 Intrinsic: 3.6120000000000001 Error: 0.0000000000000000 Speed up: 2.5134274640088594
10389 Intrinsic: 3.7269999999999999 Error: 0.0000000000000000 Speed up: 2.4358733565870674
15584 Intrinsic: 4.4550000000000001 Error: 0.0000000000000000 Speed up: 2.0378226711560044
The speed up quoted is how much faster the code ran for a giving blocking factor compared to the Transpose()
, so a speed up of 2 means my code ran twice as fast as the intrinsic, and you can see that just by restructuring the loops you can get an improvement of 8+ [yes, I know I'm being a little unfair to the intrinsic here, but the point remains]; that's equivalent to an awful lot of threads! Further if you look at the read bound version, which should reflect cache use best, you can see the maximum speed is at a blocking factor of 608. Given each transpose requires two blocks of size (blocking factor)*(blocking factor) this means for bfac=608 it needs (608*608*8*2)/(1024*1024)=5.64 Mbytes, a little under the L3 cache size of my machine (8Mbytes). Thus in this case it is more important to consider how you use your memory than how you use your cores.
Of course you can then multithread. Below are the results on my laptop, again in terms of speed up relative to the intrinsic. Moving to 2 threads gets you a little more, but not as much as the improvement above.
