OpenMP with BLAS

Question

I tried to extend the code in the answer of the above link, to include cross checks and openmp.

Program reshape_for_blas

  Use, Intrinsic :: iso_fortran_env, Only :  wp => real64, li => int64

  Implicit None

  Real( wp ), Dimension( :, :    ), Allocatable :: a
  Real( wp ), Dimension( :, :, : ), Allocatable :: b
  Real( wp ), Dimension( :, :, : ), Allocatable :: c1, c2, c3, c4, c5
  Real( wp ), Dimension( :, :    ), Allocatable :: d
  Real( wp ), Dimension( :, :    ), Allocatable :: e

  Integer :: na, nb, nc, nd, ne
  Integer :: la, lb, lc, ld  
  Integer( li ) :: start, finish, rate, numthreads

  numthreads = 2

  call omp_set_num_threads(numthreads)  
  
  
  Write( *, * ) 'na, nb, nc, nd ?'
  Read( *, * ) na, nb, nc, nd
  ne = nc * nd
  Allocate( a ( 1:na, 1:nb ) ) 
  Allocate( b ( 1:nb, 1:nc, 1:nd ) ) 
  Allocate( c1( 1:na, 1:nc, 1:nd ) ) 
  Allocate( c2( 1:na, 1:nc, 1:nd ) ) 
  Allocate( c3( 1:na, 1:nc, 1:nd ) )
  Allocate( c4( 1:na, 1:nc, 1:nd ) )
  Allocate( c5( 1:na, 1:nc, 1:nd ) )  
  Allocate( d ( 1:nb, 1:ne ) ) 
  Allocate( e ( 1:na, 1:ne ) ) 

  ! Set up some data
  Call Random_number( a )
  Call Random_number( b )

  ! With reshapes
  Call System_clock( start, rate )
  
  !write (*,*) 'clock', start, rate
  
  
  d = Reshape( b, Shape( d ) )
  Call dgemm( 'N', 'N', na, ne, nb, 1.0_wp, a, Size( a, Dim = 1 ), &
                                            d, Size( d, Dim = 1 ), &
                                    0.0_wp, e, Size( e, Dim = 1 ) )
  c1 = Reshape( e, Shape( c1 ) )
  Call System_clock( finish, rate )
  
  !write (*,*) 'clock', finish, rate
  
  
  
  Write( *, * ) 'Time for reshaping method ', Real( finish - start, wp ) / rate
  Write( *, * ) 'Difference between result matrices ', Maxval( Abs( c1 - c2 ) )
  
  
  ! Direct
  Call System_clock( start, rate )
  Call dgemm( 'N', 'N', na, ne, nb, 1.0_wp, a , Size( a , Dim = 1 ), &
                                            b , Size( b , Dim = 1 ), &
                                            0.0_wp, c2, Size( c2, Dim = 1 ) )
  Call System_clock( finish, rate )
  Write( *, * ) 'Time for straight  method ', Real( finish - start, wp ) / rate
    
    
  
  Call System_clock( start, rate )
  
  !$omp parallel
  ! Direct
  Call dgemm( 'N', 'N', na, ne, nb, 1.0_wp, a , Size( a , Dim = 1 ), &
                                            b , Size( b , Dim = 1 ), &
                                            0.0_wp, c4, Size( c4, Dim = 1 ) )
  !$omp end parallel
  Call System_clock( finish, rate )  
  Write( *, * ) 'Time for straight  method omp', Real( finish - start, wp ) / rate
    
  
 
  !naive
  Call System_clock( start, rate )

  do la = 1, na 
    do lc = 1, nc
      do ld = 1, nd
         c3(la,lc,ld) = 0.0_wp
      enddo
    enddo
  enddo
  
  do la = 1, na 
    do lb = 1, nb
      do lc = 1, nc
        do ld = 1, nd
          c3(la,lc,ld) = c3(la,lc,ld)  + a(la,lb) * b(lb, lc, ld)
        enddo  
      enddo
    enddo
  enddo  
 
  
  Call System_clock( finish, rate )
  Write( *, * ) 'Time for loop', Real( finish - start, wp ) / rate  
 
   

  !naive omp
  Call System_clock( start, rate )  
  !$omp parallel

  do la = 1, na 
    do lc = 1, nc
      do ld = 1, nd
         c5(la,lc,ld) = 0.0_wp
      enddo
    enddo
  enddo
  
  !$omp do private(la, lb, lc, ld) schedule(dynamic) reduction(+: c5)    
  do la = 1, na 
    do lb = 1, nb
      do lc = 1, nc
        do ld = 1, nd
          c5(la,lc,ld) = c5(la,lc,ld)  + a(la,lb) * b(lb, lc, ld)
        enddo  
      enddo
    enddo
  enddo  
  !$omp end do  
  !$omp end parallel
 
  
  Call System_clock( finish, rate )
  Write( *, * ) 'Time for loop omp', Real( finish - start, wp ) / rate  
 
  
  
  
  
  do la = 1, na 
    do lc = 1, nc
      do ld = 1, nd
         
         if ( dabs(c3(la,lc,ld) - c1(la,lc,ld))  > 1.e-6 ) then 
           write (*,*) '!!! c1', c3(la,lc,ld) - c1(la,lc,ld)
         endif         
         
         
         if ( dabs(c3(la,lc,ld) - c2(la,lc,ld))  > 1.e-6 ) then 
           write (*,*) '!!! c2', c3(la,lc,ld) - c2(la,lc,ld)
         endif
         
         if ( dabs(c3(la,lc,ld) - c4(la,lc,ld))  > 1.e-6 ) then 
           write (*,*) '!!! c4', la,lc,ld, c3(la,lc,ld) - c4(la,lc,ld)
         endif
         
         if ( dabs(c3(la,lc,ld) - c5(la,lc,ld))  > 1.e-6 ) then 
           write (*,*) '!!! c5', la,lc,ld, c3(la,lc,ld) - c5(la,lc,ld)
         endif         
      enddo
    enddo
  enddo  
  
End Program reshape_for_blas

I got two issues:

there is no significant speed up in neither BLAS or naive loops. E.g,., by gfortran -std=f2008 -Wall -Wextra -fcheck=all reshape.f90 -lblas -fopenmp, and input 30 30 30 60, I got

30 30 30 60
 Time for reshaping method    2.9443999999999998E-003
 Difference between result matrices    12.380937791257775
 Time for straight  method    1.0016000000000001E-003
 Time for straight  method omp   2.4878000000000001E-003
 Time for loop   6.6072500000000006E-002
 Time for loop omp  0.100242600000000002

when the dimension get larger, e.g., 60 60 60 60 in the input, the openmp BLAS result can get different value than naive loop, seems I miss some control option.

What would be the problems with OpenMP here?

Edit I added omp lines in the initialization in the c5 section and commented out two printing lines,


Program reshape_for_blas

  Use, Intrinsic :: iso_fortran_env, Only :  wp => real64, li => int64

  Implicit None

  Real( wp ), Dimension( :, :    ), Allocatable :: a
  Real( wp ), Dimension( :, :, : ), Allocatable :: b
  Real( wp ), Dimension( :, :, : ), Allocatable :: c1, c2, c3, c4, c5
  Real( wp ), Dimension( :, :    ), Allocatable :: d
  Real( wp ), Dimension( :, :    ), Allocatable :: e

  Integer :: na, nb, nc, nd, ne
  Integer :: la, lb, lc, ld  
  Integer( li ) :: start, finish, rate, numthreads

  numthreads = 2

  call omp_set_num_threads(numthreads)  
  
  
  Write( *, * ) 'na, nb, nc, nd ?'
  Read( *, * ) na, nb, nc, nd
  ne = nc * nd
  Allocate( a ( 1:na, 1:nb ) ) 
  Allocate( b ( 1:nb, 1:nc, 1:nd ) ) 
  Allocate( c1( 1:na, 1:nc, 1:nd ) ) 
  Allocate( c2( 1:na, 1:nc, 1:nd ) ) 
  Allocate( c3( 1:na, 1:nc, 1:nd ) )
  Allocate( c4( 1:na, 1:nc, 1:nd ) )
  Allocate( c5( 1:na, 1:nc, 1:nd ) )  
  Allocate( d ( 1:nb, 1:ne ) ) 
  Allocate( e ( 1:na, 1:ne ) ) 

  ! Set up some data
  Call Random_number( a )
  Call Random_number( b )

  ! With reshapes
  Call System_clock( start, rate )
  
  !write (*,*) 'clock', start, rate
  
  
  d = Reshape( b, Shape( d ) )
  Call dgemm( 'N', 'N', na, ne, nb, 1.0_wp, a, Size( a, Dim = 1 ), &
                                            d, Size( d, Dim = 1 ), &
                                    0.0_wp, e, Size( e, Dim = 1 ) )
  c1 = Reshape( e, Shape( c1 ) )
  Call System_clock( finish, rate )
  
  !write (*,*) 'clock', finish, rate
  
  
  
  Write( *, * ) 'Time for reshaping method ', Real( finish - start, wp ) / rate
  Write( *, * ) 'Difference between result matrices ', Maxval( Abs( c1 - c2 ) )
  
  
  ! Direct
  Call System_clock( start, rate )
  Call dgemm( 'N', 'N', na, ne, nb, 1.0_wp, a , Size( a , Dim = 1 ), &
                                            b , Size( b , Dim = 1 ), &
                                            0.0_wp, c2, Size( c2, Dim = 1 ) )
  Call System_clock( finish, rate )
  Write( *, * ) 'Time for straight  method ', Real( finish - start, wp ) / rate
    
    

  !naive loop
  Call System_clock( start, rate )

  do la = 1, na 
    do lc = 1, nc
      do ld = 1, nd
         c3(la,lc,ld) = 0.0_wp
      enddo
    enddo
  enddo
  
  do la = 1, na 
    do lb = 1, nb
      do lc = 1, nc
        do ld = 1, nd
          c3(la,lc,ld) = c3(la,lc,ld)  + a(la,lb) * b(lb, lc, ld)
        enddo  
      enddo
    enddo
  enddo  
 
  
  Call System_clock( finish, rate )
  Write( *, * ) 'Time for loop', Real( finish - start, wp ) / rate  
   



  !dgemm omp 
  Call System_clock( start, rate )
  
  !$omp parallel
  ! Direct
  Call dgemm( 'N', 'N', na, ne, nb, 1.0_wp, a , Size( a , Dim = 1 ), &
                                            b , Size( b , Dim = 1 ), &
                                            0.0_wp, c4, Size( c4, Dim = 1 ) )
  !$omp end parallel
  Call System_clock( finish, rate )  
  Write( *, * ) 'Time for straight  method omp', Real( finish - start, wp ) / rate
    
  
 

  !loop omp
  Call System_clock( start, rate )  
  !$omp parallel

  do la = 1, na 
    do lc = 1, nc
      do ld = 1, nd
         c5(la,lc,ld) = 0.0_wp
      enddo
    enddo
  enddo
  
  !$omp do private(la, lb, lc, ld) schedule(dynamic) reduction(+: c5)    
  do la = 1, na 
    do lb = 1, nb
      do lc = 1, nc
        do ld = 1, nd
          c5(la,lc,ld) = c5(la,lc,ld)  + a(la,lb) * b(lb, lc, ld)
        enddo  
      enddo
    enddo
  enddo  
  !$omp end do  
  !$omp end parallel
 
  
  Call System_clock( finish, rate )
  Write( *, * ) 'Time for loop omp', Real( finish - start, wp ) / rate  
 
  


!single core: c1 c2 c3
! c1 reshape blas
! c2 blas
! c3 naive loop (reference)
! parallel: c4 c5
! c4 dgemm parallel
! c5 naive loop parallel


  do la = 1, na 
    do lc = 1, nc
      do ld = 1, nd
         
         if ( dabs(c3(la,lc,ld) - c1(la,lc,ld))  > 1.e-6 ) then 
           write (*,*) '!!! c1', c3(la,lc,ld) - c1(la,lc,ld)
         endif         
         
         
         if ( dabs(c3(la,lc,ld) - c2(la,lc,ld))  > 1.e-6 ) then 
           write (*,*) '!!! c2', c3(la,lc,ld) - c2(la,lc,ld)
         endif
         
         if ( dabs(c3(la,lc,ld) - c4(la,lc,ld))  > 1.e-6 ) then 
           write (*,*) '!!! c4', la,lc,ld, c3(la,lc,ld) - c4(la,lc,ld)
         endif
         
         if ( dabs(c3(la,lc,ld) - c5(la,lc,ld))  > 1.e-6 ) then 
           write (*,*) '!!! c5', la,lc,ld, c3(la,lc,ld) - c5(la,lc,ld)
         endif         
      enddo
    enddo
  enddo  
  
End Program reshape_for_blas

then gfortran reshape.f90 -lblas -fopenmp , and 30 30 30 30 input lead to

 Time for reshaping method    1.3519000000000001E-003
 Difference between result matrices    12.380937791257775
 Time for straight  method    6.2549999999999997E-004
 Time for straight  method omp   1.2600000000000001E-003
 Time for naive loop   1.0008599999999999E-002
 Time for naive loop omp   1.6678999999999999E-002

not good speed up though.

Your OpenMP code seems significantly broken. The first case (where you call Dgemm inside an omp parallel region) is simply doing all the work in every thread (and is also hugely racy because each thread is updating all of the shared data). Your second OMP case also has races (in the initialisation, where each thread writes to the whole of the array). — Jim Cownie, Feb 22 '21 at 08:45
@JimCownie Thank you so much! For the second OMP case, I added omp lines in the initialisation lines, still not good speed up (will edit the question soon) — AlphaF20, Feb 22 '21 at 17:02
The computation of `c4` is simply wrong. You are calling the same multiplication with the same arguments in a parallel region. It makes all threads compute the same matrix several times but race in writing it to the storage. That it produces something akin to a correct result with smaller matrices is a pure coincidence and implementation specific. — Hristo Iliev, Feb 22 '21 at 19:00
I also think it is wrong, but I don't know how to write it correctly :( I tried ```gfortran -lopenblas -fopenmp reshape.f90```, still issue with ```c4```. I checked a few toturial of openmp with Fortran, did not find about blas. — AlphaF20, Feb 22 '21 at 20:01

Hristo Iliev · Accepted Answer · 2021-02-23T08:14:56.740

2

You are calling DGEMM in parallel using the same set of variables (because variables in parallel regions are shared by default in Fortran). This doesn't work and produces weird results due to data races. You have two options:

Find a parallel BLAS implementation where DGEMM is already threaded. Intel MKL and OpenBLAS are prime candidates. Intel MKL uses OpenMP, more specifically it is built with Intel OpenMP runtime, so it may not play very nicely with OpenMP code compiled with GCC, but it works perfectly with non-threaded code.
Call DGEMM in parallel but not with the same set of arguments. Instead, perform block decomposition of one or both tensors and have each thread do the contraction for a separate block. Since Fortran uses column-major storage, it may be appropriate to decompose the second tensor:
```
C[i,k,l=1..L] = A[i,j] * B[j,k,l=1..L]
```
becomes with two threads:
```
thread 0: C[i,k,l=1..L/2] = A[i,j] * B[j,k,l=1..L/2]
thread 1: C[i,k,l=L/2+1..L] = A[i,j] * B[j,k,l=L/2+1..L]
```
With an arbitrary number of threads it boils down to computing the start and end values for the l index in each thread and adjusting the arguments of DGEMM accordingly.

Personally, I'd go with a parallel BLAS implementation. With Intel MKL, you only need to link in the parallel driver and it will automagically use all the available CPUs.

A sample implementation of the block decomposition follows. Only additions and changes to your original code are shown:

  ! ADD: Use the OpenMP module
  Use :: omp_lib

  ! ADD: Variables used for the decomposition
  Integer :: ithr, istart, iend

  ! CHANGE: OpenMP with block decomposition
  !$omp parallel private(ithr, istart, iend)
    ithr = omp_get_thread_num()

    ! First index (plane) in B for the current thread
    istart = ithr * nd / omp_get_num_threads()
    ! First index (plane) in B for the next thread
    iend = (ithr + 1) * nd / opm_get_num_threads()

    Call dgemm('N', 'N', na, nc * (iend - istart), nb, 1.0_wp, a, nd, &
               b(1, 1, 1 + istart), Size(b, Dim = 1), &
               0.0_wp, c4(1, 1, 1 + istart), Size(c4, Dim = 1))
  !$omp end parallel

istart is the index of the first plane of B that each individual thread works on. iend is the first plane for the next thread, so iend - istart is the number of planes for the current thread. b(1, 1, 1 + istart) is where the block of planes in B starts. c4(1, 1, 1 + istart) is where the block in the result tensor starts.

Make sure you do one of those but not both simultaneously. I.e., if your BLAS implementation is threaded but you decide to go with block decomposition, disable threading in the BLAS library. Conversely, if you use threading in the BLAS implementation, do not perform block decomposition in your code.

edited Feb 23 '21 at 08:14

answered Feb 22 '21 at 08:58

Hristo Iliev

72,659
12
135
186

Thank you so much! About the first option, I tried ```gfortran reshape.f90 -openblas -fopenmp``` and got ```undefined reference to `dgemm_'```; about MKL, I tried to link gfotran to it by ```${MKLPATH}/libmkl_blas95_ilp64.a ${MKLPATH}/libmkl_intel_ilp64.a ${MKLPATH}/libmkl_intel_thread.a ${MKLPATH}/libmkl_core.a -liomp5 -lpthread -lm -ldl -fopenmp``` in ```Makefile```, I could complile it, but got inconsistent results in sections ```c1``` ,```c2```, and ```c4```, sorry for the stupid reply – AlphaF20 Feb 22 '21 at 17:34
You are computing the difference between `c1` and `c2` after you've computed `c1` but before you've computed `c2`. – Hristo Iliev Feb 22 '21 at 17:40
Oops, I am so sorry. I commented out ```Write( *, * ) 'Time for reshaping method ', Real( finish - start, wp ) / rate ! Write( *, * ) 'Difference between result matrices ', Maxval( Abs( c1 - c2 ) )```, the output is e.g., ```!!! c1 6.0495335176173199 !!! c2 6.0495335176173199 !!! c4 1 20 9 6.0495335176173199``` – AlphaF20 Feb 22 '21 at 17:54
You have three different methods involving `DGEMM` produce the same "error". That should tell you something about the root of the problem. Perhaps it's `c3` that has the wrong value or you call `DGEMM` with the wrong arguments. – Hristo Iliev Feb 22 '21 at 19:03
```c3``` is by nested loop without openmp, I think it is ok. I tried ```gfortran -lopenblas -fopenmp reshape.f90``` on another computer, only ```c4``` is different, perhaps something wrong with ```openblas``` with my computer. I think ```c4``` is wrong, but don't know how to write correctly :(. – AlphaF20 Feb 22 '21 at 20:03
I told you how to write it correctly - perform block decomposition. Now there is sample Fortran code in the answer, not only math. – Hristo Iliev Feb 23 '21 at 08:09
Thank you very much for your beautiful code. I am so sorry for many stupid questions :( 1) how to disable threading in the BLAS library? 2) ```c4(1, 1, 1 + istart)``` seems the indices are specified, while in the code in question, only ```c4```, but the new calculation went through all indices, why? 3) About the parallel BLAS approach, will it be ```ifort reshape.f90 -qopenmp -mkl``` or `gfortran reshape.f90 -lopenblas -fopenmp```, that's it? I tried again, but not much speedup – AlphaF20 Feb 23 '21 at 19:21
1

1) Read the documentation of the BLAS library. [This here](https://github.com/xianyi/OpenBLAS/wiki/Faq#multi-threaded) is a good start. 2) Each thread specifies a different starting location in `b` and `c4` but it also specifies a block size that is less than the full dimension (the 4th argument of `DGEMM` is `nc * (iend - istart)`). 3) Read the documentation of the BLAS library. With `ifort` and MKL you only need `ifort reshape.f90 -mkl=parallel`. – Hristo Iliev Feb 23 '21 at 19:34
I'd be careful calling BLAS from within a parallel section: you can oversubscribe your system (more threads than cores) as each library is oblivious to the other's choice of number threads (e.g., typical problem when mixing OpenMP with OpenBLAS from repos). BLAS libraries have some mechanism of restricting the number of threads or you should use the same OpenMP runtime on both your code and the BLAS lib. – ipapadop Mar 02 '21 at 20:03

OpenMP with BLAS

1 Answers1

Linked