Parallelization limit of OMP in DGEMM

Question

For the following code extended from OpenMP with BLAS

  Program bench_dgemm
  
    Use, Intrinsic :: iso_fortran_env, Only :  wp => real64, li => int64
    Use :: omp_lib
    integer, parameter :: dp = selected_real_kind(15, 307)
    
    Real( dp ), Dimension( :, :    ), Allocatable :: a
    Real( dp ), Dimension( :, :, : ), Allocatable :: b
    Real( dp ), Dimension( :, :, : ), Allocatable :: c

    Integer :: na, nb, nc, nd, m, m_iter
    Integer( li ) :: start, finish, rate
    Integer :: numthreads
    Integer :: ithr, istart, iend  
    real(dp) :: sum_time
  
  
    Write( *, * ) 'numthreads'
    Read( *, * ) numthreads

    call omp_set_num_threads(numthreads)  
    
    Write( *, * ) 'na, nb, nc, nd ?'
    Read( *, * ) na, nb, nc, nd
    Allocate( a ( 1:na, 1:nb ) ) 
    Allocate( b ( 1:nb, 1:nc, 1:nd ) ) 
    Allocate( c( 1:na, 1:nc, 1:nd ) ) 


  !A[a,b] * B[b,c,d] = C[a,c,d]
    Call Random_number( a )
    Call Random_number( b )
    c = 0.0_dp

    m_iter = 30
  
    write (*,*) 'm_iter average', m_iter 
    write (*,*) 'numthreads', numthreads
  
    sum_time = 0.0  
    do m = 1, m_iter 
      Call System_clock( start, rate )
  
    !$omp parallel private(ithr, istart, iend)
      ithr = omp_get_thread_num()

      istart = ithr * nd / numthreads
      iend = (ithr + 1) * nd / numthreads
  
      Call dgemm('N', 'N', na, nc * (iend - istart), nb, 1.0_dp, a, na, &
                 b(1, 1, 1 + istart), Size(b, Dim = 1), &
                 0.0_dp, c(1, 1, 1 + istart), Size(c, Dim = 1))
    !$omp end parallel
  
  
  
      Call System_clock( finish, rate )
      sum_time = sum_time + Real( finish - start, dp ) / rate  
    end do 
  
    Write( *, * ) 'Time for dgemm', sum_time / m_iter
  
  End

suppose the file is called bench.f90. I tried ifort bench.f90 -o bench -qopenmp -mkl=sequential, then bench.

For na=nb=nc=nd=200, numthreads=1 gives me

1 Time for dgemm  4.053670000000001E-002
2 Time for dgemm  2.087716666666666E-002
4 Time for dgemm  1.082136666666667E-002
8 Time for dgemm  5.819133333333333E-003
16 Time for dgemm  4.304533333333333E-003
32 Time for dgemm  5.269366666666666E-003

I tried gfortran bench.f90 -o bench -fopenmp -lopenblas and got

1 Time for dgemm  0.13534268956666665
2 Time for dgemm   6.9672616866666662E-002
4 Time for dgemm   3.5927094433333334E-002
8 Time for dgemm   1.8583297666666668E-002
16 Time for dgemm   1.1969903900000000E-002
32 Time for dgemm   1.9136184166666667E-002

It seems the omp gets less speed up in 32 cores (Intel(R) Xeon(R) Gold 6148 CPU @ 2.40GHZ 2 sockets. Thus 40 cores). I think the split of indices is the external one in a matrix. Similar to A[a,b]B[b,c], the code splits c into several segments. It should be straightforward to parallel. So, why the performance does not get much faster ~ 32 cores? (If the dimension of c in B[a,c,d] is only 30, I can imagine 32 core will not help.)

Does MPI have a better performance comparing with the OpenMP and the ideal scaling?

What kind of hardware do you have? What exact CPU models? Are they real full cores? How many FPUs? Any hyperthreading? Some overhead is normal in parallel computing. — Vladimir F Героям слава, Jan 05 '22 at 07:36
I suggest to move the number of threads to a variable. Do not call a function so many times. May be it is optimized to a variable by tye compiler, but you can never be sure. — Vladimir F Героям слава, Jan 05 '22 at 07:40
Intel(R) Xeon(R) Gold 6148 CPU @ 2.40GHz. Thanks, I will adjust the number of threads entries. — AlphaF20, Jan 05 '22 at 07:46
Is it one CPU socket or two sockets? This CPU has 20 cores and 40 virtual cores with hyperthreading. — Vladimir F Героям слава, Jan 05 '22 at 07:51
Thanks. `lscpu` tells me `Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 40 On-line CPU(s) list: 0-39 Thread(s) per core: 1 Core(s) per socket: 20 Socket(s): 2 NUMA node(s): 2 ` I guess 2 sockets. I am keep on checking — AlphaF20, Jan 05 '22 at 07:53
`Vendor ID: GenuineIntel CPU family: 6 Model: 85 Model name: Intel(R) Xeon(R) Gold 6148 CPU @ 2.40GHz Stepping: 4 CPU MHz: 2400.000 BogoMIPS: 4800.00 Virtualization: VT-x L1d cache: 32K L1i cache: 32K L2 cache: 1024K L3 cache: 28160K NUMA node0 CPU(s): 0-19 NUMA node1 CPU(s): 20-39 ` — AlphaF20, Jan 05 '22 at 07:56
For a small matrix mult like 200x200 looks like what I would expect - in fact pretty good. Does it get better if you make the matrices bigger, e.g. 1000x1000? — Ian Bush, Jan 05 '22 at 08:11
Thanks. I will try. But, theoretically, why it cannot be ideally paralleled, since I thought external index operations are indenpendent, namely `A[a,b] B[b,c,d]`, each `c` does not need to communicate with other `c` (?) — AlphaF20, Jan 05 '22 at 08:12
Disregard my comments about matrix multiplication. But hitting some overhead in OpenMP is normal. — Vladimir F Героям слава, Jan 05 '22 at 08:12
for `na nb nc nd 200 200 1000 200`, it (`ifort` + `mkl`) gets better `16 Time for dgemm 1.895113333333333E-002`; `32 Time for dgemm 1.484706666666667E-002`. Curious, what is the reason for this `openmp` overhead? If the dimension of `c` is only 30, I can imagine 32 core will not help. — AlphaF20, Jan 05 '22 at 08:20
My guess is that the memory subsytem can't keep all the cores fed all the time - this is usual, even if you ran 32 totally independent programs I wouldn't expect perfect speed up, in fact far from it, as they will al contend for the same memory bus. Other possible reasons include thread placement, thread migration, the local neutron flux and the phase of the moon - a modern computer is a very complex device, and often there just aren't simple explanations for the details of the observed performance. What you've got is fine. — Ian Bush, Jan 05 '22 at 08:39
For dual socket designs it's crucial that you set `OMP_PROC_BIND=true`. Did you? — Victor Eijkhout, Jan 05 '22 at 10:50
Thanks a lot! I did not. After `export OMP_PROC_BIND=true`, most times it helps. E.g., 16 cores `4.204800000000000E-003`, 32 cores `3.335600000000001E-003`. — AlphaF20, Jan 05 '22 at 20:02

score 0 · Answer 1 · answered Jan 20 '22 at 09:53

We tried the shared sample code at our end and we could see the results as below. Try "setenv OMP_PROC_BIND true" and exporting the same, as it should help in your case.

numthreads 1 na, nb, nc, nd ? 200 200 200 200 m_iter average 30 numthreads 1 MKL_VERBOSE oneMKL 2022.0 Product build 20211112 for Intel(R) 64 architecture Intel(R) Advanced Vector Extensions 2 (Intel(R) AVX2) enabled processors, Lnx 2.20GHz lp64 sequential MKL_VERBOSE DGEMM(N,N,200,40000,200,0x4b1490,0x1488e659b240,200,0x1488d7cfa280,200,0x4b1498,0x1488d3cf92c0,200) 113.23ms CNR:OFF Dyn:1 FastMM:1 MKL_VERBOSE DGEMM(N,N,200,40000,200,0x4b1490,0x1488e659b240,200,0x1488d7cfa280,200,0x4b1498,0x1488d3cf92c0,200) 113.50ms CNR:OFF Dyn:1 FastMM:1 MKL_VERBOSE DGEMM(N,N,200,40000,200,0x4b1490,0x1488e659b240,200,0x1488d7cfa280,200,0x4b1498,0x1488d3cf92c0,200) 113.66ms CNR:OFF Dyn:1 FastMM:1 MKL_VERBOSE DGEMM(N,N,200,40000,200,0x4b1490,0x1488e659b240,200,0x1488d7cfa280,200,0x4b1498,0x1488d3cf92c0,200) 113.68ms CNR:OFF Dyn:1 FastMM:1 MKL_VERBOSE DGEMM(N,N,200,40000,200,0x4b1490,0x1488e659b240,200,0x1488d7cfa280,200,0x4b1498,0x1488d3cf92c0,200) 113.64ms CNR:OFF Dyn:1 FastMM:1 MKL_VERBOSE DGEMM(N,N,200,40000,200,0x4b1490,0x1488e659b240,200,0x1488d7cfa280,200,0x4b1498,0x1488d3cf92c0,200) 113.63ms CNR:OFF Dyn:1 FastMM:1 MKL_VERBOSE DGEMM(N,N,200,40000,200,0x4b1490,0x1488e659b240,200,0x1488d7cfa280,200,0x4b1498,0x1488d3cf92c0,200) 113.67ms CNR:OFF Dyn:1 FastMM:1 MKL_VERBOSE DGEMM(N,N,200,40000,200,0x4b1490,0x1488e659b240,200,0x1488d7cfa280,200,0x4b1498,0x1488d3cf92c0,200) 113.71ms CNR:OFF Dyn:1 FastMM:1 MKL_VERBOSE DGEMM(N,N,200,40000,200,0x4b1490,0x1488e659b240,200,0x1488d7cfa280,200,0x4b1498,0x1488d3cf92c0,200) 113.74ms CNR:OFF Dyn:1 FastMM:1 MKL_VERBOSE DGEMM(N,N,200,40000,200,0x4b1490,0x1488e659b240,200,0x1488d7cfa280,200,0x4b1498,0x1488d3cf92c0,200) 113.68ms CNR:OFF Dyn:1 FastMM:1 MKL_VERBOSE DGEMM(N,N,200,40000,200,0x4b1490,0x1488e659b240,200,0x1488d7cfa280,200,0x4b1498,0x1488d3cf92c0,200) 113.65ms CNR:OFF Dyn:1 FastMM:1 MKL_VERBOSE DGEMM(N,N,200,40000,200,0x4b1490,0x1488e659b240,200,0x1488d7cfa280,200,0x4b1498,0x1488d3cf92c0,200) 113.71ms CNR:OFF Dyn:1 FastMM:1 MKL_VERBOSE DGEMM(N,N,200,40000,200,0x4b1490,0x1488e659b240,200,0x1488d7cfa280,200,0x4b1498,0x1488d3cf92c0,200) 113.68ms CNR:OFF Dyn:1 FastMM:1 MKL_VERBOSE DGEMM(N,N,200,40000,200,0x4b1490,0x1488e659b240,200,0x1488d7cfa280,200,0x4b1498,0x1488d3cf92c0,200) 113.67ms CNR:OFF Dyn:1 FastMM:1 MKL_VERBOSE DGEMM(N,N,200,40000,200,0x4b1490,0x1488e659b240,200,0x1488d7cfa280,200,0x4b1498,0x1488d3cf92c0,200) 116.28ms CNR:OFF Dyn:1 FastMM:1 MKL_VERBOSE DGEMM(N,N,200,40000,200,0x4b1490,0x1488e659b240,200,0x1488d7cfa280,200,0x4b1498,0x1488d3cf92c0,200) 143.58ms CNR:OFF Dyn:1 FastMM:1 MKL_VERBOSE DGEMM(N,N,200,40000,200,0x4b1490,0x1488e659b240,200,0x1488d7cfa280,200,0x4b1498,0x1488d3cf92c0,200) 105.96ms CNR:OFF Dyn:1 FastMM:1 MKL_VERBOSE DGEMM(N,N,200,40000,200,0x4b1490,0x1488e659b240,200,0x1488d7cfa280,200,0x4b1498,0x1488d3cf92c0,200) 105.98ms CNR:OFF Dyn:1 FastMM:1 MKL_VERBOSE DGEMM(N,N,200,40000,200,0x4b1490,0x1488e659b240,200,0x1488d7cfa280,200,0x4b1498,0x1488d3cf92c0,200) 106.06ms CNR:OFF Dyn:1 FastMM:1 MKL_VERBOSE DGEMM(N,N,200,40000,200,0x4b1490,0x1488e659b240,200,0x1488d7cfa280,200,0x4b1498,0x1488d3cf92c0,200) 105.99ms CNR:OFF Dyn:1 FastMM:1 MKL_VERBOSE DGEMM(N,N,200,40000,200,0x4b1490,0x1488e659b240,200,0x1488d7cfa280,200,0x4b1498,0x1488d3cf92c0,200) 106.12ms CNR:OFF Dyn:1 FastMM:1 MKL_VERBOSE DGEMM(N,N,200,40000,200,0x4b1490,0x1488e659b240,200,0x1488d7cfa280,200,0x4b1498,0x1488d3cf92c0,200) 106.06ms CNR:OFF Dyn:1 FastMM:1 MKL_VERBOSE DGEMM(N,N,200,40000,200,0x4b1490,0x1488e659b240,200,0x1488d7cfa280,200,0x4b1498,0x1488d3cf92c0,200) 106.01ms CNR:OFF Dyn:1 FastMM:1 MKL_VERBOSE DGEMM(N,N,200,40000,200,0x4b1490,0x1488e659b240,200,0x1488d7cfa280,200,0x4b1498,0x1488d3cf92c0,200) 105.93ms CNR:OFF Dyn:1 FastMM:1 MKL_VERBOSE DGEMM(N,N,200,40000,200,0x4b1490,0x1488e659b240,200,0x1488d7cfa280,200,0x4b1498,0x1488d3cf92c0,200) 106.08ms CNR:OFF Dyn:1 FastMM:1 MKL_VERBOSE DGEMM(N,N,200,40000,200,0x4b1490,0x1488e659b240,200,0x1488d7cfa280,200,0x4b1498,0x1488d3cf92c0,200) 106.07ms CNR:OFF Dyn:1 FastMM:1 MKL_VERBOSE DGEMM(N,N,200,40000,200,0x4b1490,0x1488e659b240,200,0x1488d7cfa280,200,0x4b1498,0x1488d3cf92c0,200) 106.09ms CNR:OFF Dyn:1 FastMM:1 MKL_VERBOSE DGEMM(N,N,200,40000,200,0x4b1490,0x1488e659b240,200,0x1488d7cfa280,200,0x4b1498,0x1488d3cf92c0,200) 106.10ms CNR:OFF Dyn:1 FastMM:1 MKL_VERBOSE DGEMM(N,N,200,40000,200,0x4b1490,0x1488e659b240,200,0x1488d7cfa280,200,0x4b1498,0x1488d3cf92c0,200) 106.03ms CNR:OFF Dyn:1 FastMM:1 MKL_VERBOSE DGEMM(N,N,200,40000,200,0x4b1490,0x1488e659b240,200,0x1488d7cfa280,200,0x4b1498,0x1488d3cf92c0,200) 106.05ms CNR:OFF Dyn:1 FastMM:1 Time for dgemm 0.116057933333333

Thank you so much. Actually `OMP_PROC_BIND=true` was mentioned in Victor Eijkhout's comment. I think I only need `export` in `-bash` enviroment. `setenv` is for `csh` https://unix.stackexchange.com/questions/368944/what-is-the-difference-between-env-setenv-export-and-when-to-use — AlphaF20, Jan 20 '22 at 17:06

Parallelization limit of OMP in DGEMM

1 Answers1