A question about the efficiency of auto parallel openblas

Question

I tried to use openblas to do a matrix multiplication and use its parallel version without writing DGEMM to different threads. The efficiency is not good when the common dimension is large. Is that a limitation of openblas that I should either write parallel dgemm manually or look for mkl?

Here is the Fortran code

Program test

    Use, Intrinsic :: iso_fortran_env, Only :  wp => real64, li => int64
    integer, parameter :: dp = selected_real_kind(15, 307)
    
    Real( dp ), Dimension( :, : ), Allocatable :: a
    Real( dp ), Dimension( :, : ), Allocatable :: b
    Real( dp ), Dimension( :, : ), Allocatable :: c

    Integer :: na, nb, nc, m_iter
    Integer :: numthreads
    Integer( li ) :: start, finish, rate
    real(dp) :: sum_time1, alpha, beta

  
    Write( *, * ) 'number of threads'
    Read( *, * ) numthreads

    Write( *, * ) 'na, nb, nc?'
    Read( *, * ) na, nb, nc
    Allocate( a ( 1:na, 1:nb ) ) 
    Allocate( b ( 1:nb, 1:nc ) )  
    Allocate( c( 1:na, 1:nc ) )

  
    Call Random_number( a )
    Call Random_number( b )
    c  = 0.0_dp
 
    alpha = 1.0_dp
    beta = 0.0_dp
    m_iter = 10
  
    write (*,*) 'm_iter average', m_iter 
    !call omp_set_num_threads(1)
    call openblas_set_num_threads(1)
    sum_time1 = 0.0  
    do m = 1, m_iter     
      Call System_clock( start, rate )
      Call dgemm( 'N', 'N', na, nc, nb, alpha, a , na,  b , nb, beta, c, nc )
      Call System_clock( finish, rate )
      sum_time1 = sum_time1 + Real( finish - start, dp ) / rate  
    end do  
    Write( *, * ) 'Time for dgemm-1', sum_time1 / m_iter
      

   ! call omp_set_num_threads(numthreads)  
  !  call mkl_set_num_threads(numthreads)
    call openblas_set_num_threads(numthreads)
    sum_time1 = 0.0  
    do m = 1, m_iter     
      Call System_clock( start, rate )
      Call dgemm( 'N', 'N', na, nc, nb, alpha, a , na,  b , nb, beta, c, nc )
      Call System_clock( finish, rate )
      sum_time1 = sum_time1 + Real( finish - start, dp ) / rate  
    end do  
    Write( *, * ) 'Time for dgemm-auto parallel', sum_time1 / m_iter
 
     deallocate(a, b, c) 

  End

if I use gfortran-11 -O3 -fopenmp test.f90 -lopenblas (suppose it is called test.f90) and I got


 number of threads

2

           2

 na, nb, nc?

6 100000 6

 m_iter average          10
Time for dgemm-1  0.15203639000000002
 Time for dgemm-auto parallel  0.20141369999999997

parallel one is slower than single processor one. For 1000*1000 matrix, there is an improvement of speed.

(edited for parallel setting for the first dgemm)

A matrix multiplication requires na*nb*nc FMA operations. In your case that is 3.6e6 operations. To multiply 2 1000*1000 matrices, 1e9 FMA operations are needed. The workload in the 6*100000*6 case is maybe too low to benefit from multithreading. Try with nb=1e7 elements — PierU, Jun 14 '23 at 08:04
Thanks. number of threads 2 na, nb, nc? 6 10000000 6 m_iter average 10 Time for dgemm-1 0.15666251000000003 Time for dgemm-auto parallel 0.20545938000000002 — user235862, Jun 14 '23 at 12:42
How do you know how many threads `dgemm` uses in the dgemm-1 case? It looks like you are assuming it uses a single thread, but are you sure about that? I would insert `call omp_set_num_threads(1)` for the dgemm-1 test. — PierU, Jun 14 '23 at 15:50
Does [this](https://stackoverflow.com/questions/72663092/getting-numpy-linalg-svd-and-numpy-matrix-multiplication-to-use-multithreadng) answer answer your question? It seems very similar to me. In your pathological case, writing your own dgemm can worth the effort, especially since na and nc are very small (even more if you know they are constant and will never be bigger). — Jérôme Richard, Jun 14 '23 at 21:28
@PierU edited, still the same. Jérôme Richard thanks. That is related. Perhaps `openblas` is not good at ""thin" matrices" — user235862, Jun 14 '23 at 22:31
@JérômeRichard from your linked answer, it looks like OpenBLAS would not use multithreading here because the output matrix is only 6x6. But then the two timings in the above test should be the same. — PierU, Jun 15 '23 at 07:54
@PierU modified to `call openblas_set_num_threads(1)`, same thing. — user235862, Jun 15 '23 at 20:22
The fact that `openblas_set_num_threads(1)` results in the same timings tends to indicate that OpenBLAS was using 1 thread as pointed out in the answer, although IDK why the 2 timings are different... Another way to check that is simply to track the CPU usage of the target machine (eg. using `htop` on Linux). — Jérôme Richard, Jun 16 '23 at 18:24

A question about the efficiency of auto parallel openblas

0 Answers0