Mitigating shared library bottlenecks while running many processes

Question

I'm doing some benchmarking on a 2 socket motherboard with 2 Intel 6230's (total of 40 cores). The computer is running RHEL-7.6 and utilizes NUMA. My ultimate goal is to determine the performance difference between using Intel's MKL library on an Intel vs. an AMD machine.

I installed python-3.7.3 using Anaconda. Looking at numpy's shared library:

ldd /home/user/local/python/3.7/lib/python3.7/site-packages/numpy/linalg/lapack_lite.cpython-37m-x86_64-linux-gnu.so
    linux-vdso.so.1 =>  (0x00002aaaaaacc000)
    libmkl_rt.so => /home/user/local/python/3.7/lib/python3.7/site-packages/numpy/linalg/../../../../libmkl_rt.so (0x00002aaaaaccf000)
    libpthread.so.0 => /lib64/libpthread.so.0 (0x00002aaaab3b6000)
    libc.so.6 => /lib64/libc.so.6 (0x00002aaaab5d2000)
    libdl.so.2 => /lib64/libdl.so.2 (0x00002aaaab995000)
    /lib64/ld-linux-x86-64.so.2 (0x00002aaaaaaab000)

You can see that it depends on libmkl_rt.so. Presumably the linear algebra routines like np.dot() depend on this. So I run the following code, numpy_test.py :

import numpy as np
matrixSize = 5000     # outer dim of mats to run on
N = 50       # num of trials 
np.random.seed(42)
Ax = matrixSize
Ay = 10000
Bx = 10000
By = matrixSize
A=np.random.rand(Ax,Ay)
B=np.random.rand(Bx,By)
npStartTime = time.time()
for i in range(N):
    AB = np.dot(A,B)
print("Run time : {:.4f} s".format((time.time() - npStartTime)))

Running this with ~~one core~~(wrong, see below) takes about 17.5 seconds. If I run it on all 40 cores simultaneously, the average run time is 1200s for each process. This answer attempts to provide a solution for mitigating this problem. Two of the possible solutions don't even work and the third option (dplace) doesn't seem to be easily accessible for RHEL 7.6.

Question

Is it plausible that the huge performance hit when running 40 processes is due to all the processes competing for access to the shared library (presumably libmkl_rt.so) which only lives in one place in memory?
If true, are there modern solutions to force each core to use its own copy of a shared library? I can't seem to find a static version of libmkl_rt.so to build numpy on.

EDIT

Following the suggestion of Gennady.F.Intel, I ran :

$ export MKL_VERBOSE=1; python3 src/numpy_attempt.py
Numpy + Intel(R) MKL: THREADING LAYER: (null)
Numpy + Intel(R) MKL: setting Intel(R) MKL to use INTEL OpenMP runtime
Numpy + Intel(R) MKL: preloading libiomp5.so runtime
MKL_VERBOSE Intel(R) MKL 2019.0 Update 4 Product build 20190411 for Intel(R) 64 architecture Intel(R) Advanced Vector Extensions 512 (Intel(R) AVX-512) with support of Vector Neural Network Instructions enabled processors, Lnx 2.10GHz lp64 intel_thread
MKL_VERBOSE SDOT(2,0x555555c71cc0,1,0x555555c71cc0,1) 2.58ms CNR:OFF Dyn:1 FastMM:1 TID:0  NThr:40
MKL_VERBOSE DGEMM(N,N,5000,5000,10000,0x7fffffffc870,0x2aaad834b040,5000,0x2aaac05d2040,10000,0x7fffffffc878,0x2aaaf00c4040,5000) 370.98ms CNR:OFF Dyn:1 FastMM:1 TID:0  NThr:40
.
.

So I think the contention for resources has more to do with the fact that each of my 40 instances is asking for 40 threads each for a total of 1600 threads. If I export MKL_NUM_THREADS=1 and run my 40 instances of numpy_test.py the average run time is ~440 seconds. Running a single instance of numpy_test.py on the machine takes 240s. I think the discrepancy is explained, but the questions are yet to be answered.

Could you try to see what mkl will return by setting "export MKL_VERBOSE=1" environment variable? — Gennady.F, Oct 22 '19 at 09:55
Great suggestion. See above edits. It solves the massive run time difference, but doesn't illuminate the answers to the questions. — irritable_phd_syndrome, Oct 22 '19 at 13:11
yes, this is the typical threads oversubscription problems and MKL has only one solution to run the code with 1 thread only, but you already found out this solution by himself. — Gennady.F, Oct 23 '19 at 09:14

Mitigating shared library bottlenecks while running many processes

0 Answers0