I have a driver cpp file that calls cblas_dgbmv function with proper arguments. When I build OpenBLAS with "make", dgbmv runs with 8 threads automatically (multithreaded dgbmv is invoked in gbmv.c interface and I assume this is a default behaviour). On the contrary, when I provide OPENBLAS_NUM_THREADS=1 after this build, sequential version runs and everything goes well. All good for now.
The problem is, I would like to assess performance of the multithreaded cblas_dgbmv based on different threads, by using a loop that calls this function 1000 times serially and measuring the time. My driver is sequential. However, even 2 threaded dgbmv degrades the performance (execution time), being a single multithreaded call, without the loop.
I researched about multithreaded runs of OpenBLAS and ensured everything conforms to specifications. There is no thread spawning or any pragma directives in my driver (it solely runs a master thread just to measure wall clock). IN other words, I call DGBMV in a sequential region, not to conflict with threads of OpenBLAS. However, I sense something like, excessive threads are running and therefore execution slows down, although, I have already set all env variables regarding #threads except OPENBLAS_NUM_THREADS to 1.
I use openmp walll clock time and measure the execution time with a code surrounding only this 1000-times caller loop, so that is fine as well :
double seconds,timing=0.0;
//for(int i=0; i<10000; i++){
seconds = omp_get_wtime ( );
cblas_dgbmv(CblasColMajor, CblasNoTrans , n, n, kl, ku, alpha, B, lda, X, incx, beta, Y, incy);
timing += omp_get_wtime ( ) - seconds;
// }
I run my driver code with a proper env variable set in runtime (OPENBLAS_NUM_THREADS=4 ./myBinary args...). Here is my Makefile to compile both lbrary and the application :
myBinary: myBinary.cpp
cd ./xianyi-OpenBLAS-0b678b1 && make USE_THREAD=1 USE_OPENMP=0 NUM_THREADS=4 && make PREFIX=/home/selin/HPC-Research/xianyi-OpenBLAS-0b678b1 install
g++ myBinary.cpp -o myBinary -I/home/selin/HPC-Research/xianyi-OpenBLAS-0b678b1/include/ -L/home/selin/HPC-Research/xianyi-OpenBLAS-0b678b1/lib -Wl,-rpath,/home/selin/HPC-Research/xianyi-OpenBLAS-0b678b1/lib -lopenblas -fopenmp -lstdc++fs -std=c++17
Architecture : 64 cores shared memory with AMD Opteron Processors
I would be more than happy if anyone could explain what goes wrong with the multithreaded version of dgbmv.