Multithreaded OpenBlas degrades performance

Question

I have a driver cpp file that calls cblas_dgbmv function with proper arguments. When I build OpenBLAS with "make", dgbmv runs with 8 threads automatically (multithreaded dgbmv is invoked in gbmv.c interface and I assume this is a default behaviour). On the contrary, when I provide OPENBLAS_NUM_THREADS=1 after this build, sequential version runs and everything goes well. All good for now.

The problem is, I would like to assess performance of the multithreaded cblas_dgbmv based on different threads, by using a loop that calls this function 1000 times serially and measuring the time. My driver is sequential. However, even 2 threaded dgbmv degrades the performance (execution time), being a single multithreaded call, without the loop.

I researched about multithreaded runs of OpenBLAS and ensured everything conforms to specifications. There is no thread spawning or any pragma directives in my driver (it solely runs a master thread just to measure wall clock). IN other words, I call DGBMV in a sequential region, not to conflict with threads of OpenBLAS. However, I sense something like, excessive threads are running and therefore execution slows down, although, I have already set all env variables regarding #threads except OPENBLAS_NUM_THREADS to 1.

I use openmp walll clock time and measure the execution time with a code surrounding only this 1000-times caller loop, so that is fine as well :

  double seconds,timing=0.0;
 //for(int i=0; i<10000; i++){
        seconds = omp_get_wtime ( );
        cblas_dgbmv(CblasColMajor, CblasNoTrans , n, n, kl, ku, alpha, B, lda, X, incx, beta, Y, incy);
        timing += omp_get_wtime ( ) - seconds;
   // }

I run my driver code with a proper env variable set in runtime (OPENBLAS_NUM_THREADS=4 ./myBinary args...). Here is my Makefile to compile both lbrary and the application :

myBinary: myBinary.cpp
    cd ./xianyi-OpenBLAS-0b678b1 && make USE_THREAD=1 USE_OPENMP=0 NUM_THREADS=4  &&  make PREFIX=/home/selin/HPC-Research/xianyi-OpenBLAS-0b678b1  install
    g++ myBinary.cpp -o myBinary -I/home/selin/HPC-Research/xianyi-OpenBLAS-0b678b1/include/ -L/home/selin/HPC-Research/xianyi-OpenBLAS-0b678b1/lib -Wl,-rpath,/home/selin/HPC-Research/xianyi-OpenBLAS-0b678b1/lib -lopenblas -fopenmp -lstdc++fs -std=c++17

Architecture : 64 cores shared memory with AMD Opteron Processors

I would be more than happy if anyone could explain what goes wrong with the multithreaded version of dgbmv.

Your architecture is a NUMA one (certainly with strong NUMA effects) and your operation is certainly memory-bound. Thus, there is a very high chance for the problem to come from *NUMA effect*. To confirm that you can try to run your code on 1 socket alone and check for the scalability. Platforms like yours are not easy to use efficiently. You can find interesting information to deal with that in [this post](https://stackoverflow.com/questions/71340798/problem-of-sorting-openmp-threads-into-numa-nodes-by-experiment/71343253#71343253). — Jérôme Richard, Jun 21 '22 at 17:16
By the way you should care about the first touch of memory page or at least control the NUMA allocation memory policy with numactl. Here are other related posts about this: https://stackoverflow.com/questions/64409563#64415109 and https://stackoverflow.com/questions/62604334#62615032 — Jérôme Richard, Jun 21 '22 at 17:20
@JérômeRichard Thank you for answer. I really appreciate your insight about the hardware architecture, whose characteristics I was totally unaware of. I am currently running a parallel program with openmp and it scales well. However, with 2 threads, htop command displays many total threads (eg 100tasks500 threads), with only 2-4 of them being executed on CPUs. How they become created? They do not linearly increase either. Although this is weird, as i said, it scales well (as opposed to the problem with openblas). I'll try to digest lower level details about numa to resolve it ASAP. — selinnilesy, Jun 23 '22 at 11:48
@JérômeRichard I am asuming those threads are somehow operating in the system, irrelevant to my app. But, if there is a heavy influence of NUMA on the performance, why scalability of my own openmp program did not disrupt outcomes as well? This is weird, as i said, as opposed to the problem with openblas. I am still not managing NUMA policies at all. — selinnilesy, Jun 23 '22 at 13:21
@JérômeRichard besides, how to run in 1 socket to test numa efffect? architecture has 4 sockets but 8 numa nodes with 8 processors in each node. i will try for 1 node for now by setting OMP_PLACES to 0 to 8 cpus. hope it is meaningful. — selinnilesy, Jun 23 '22 at 13:52

selinnilesy · Answer 1 · 2022-06-23T14:30:17.580

In my own program that scales well (different than multithreaded openblas mentioned above) i have tried setting GOMP_CPU_AFFINITY to 0..8 and PROC_BIND to true, and also OMP_PLACES to threads(8) for the sake of running 8 threads on the first 8 cpus (or cores) with no hyperthreading. then i have visually checked via htop utility every thread is being executed on the first numa node with 8 processors. after ensuring that, result was 5 seconds longer. by unsetting these variables, i got 5 secs faster result. @JérômeRichard. I2ll try the same thing for openblas driver as well.

selinnilesy · Answer 2 · 2022-06-28T14:53:06.227

I have just tried what I have written in the other comment(settings for my own openmp program) for openblas. I have built the library with make USE_OPENMP=1 (as i stated its a sequential driver anyways) and num_threads=256 to set a maximum number. After I run openblas multithreaded, htop displays multiple threads running in the same numa node (e.g. first 8 cores), by which I arranged using environment variables proc_bind=true and places hardware threads. However, even 1 call to multithreaded dgbmv is slower than sequential (1 thread version).

Besides, In my system, Multithreaded OpenBlas threads are sleeping and running in turn (although in my own openmp parallel program all threads always in running state), and their CPU utilization is low, somewhere around 60%.

screenshot of htop

Multithreaded OpenBlas degrades performance

2 Answers2