How to improve the performance of NumPy's BLAS-related functions?

Question

I have a function which returns the square of the residual norm of a big linear equation system.

In [1]: import numpy as np                                                      

In [2]: A = np.random.rand(3600000, 200)                                        

In [3]: b = np.random.rand(3600000)                                             

In [4]: def f(x): 
   ...:     global A
   ...:     global b
   ...:     return np.linalg.norm(A.dot(x) - b)**2

Now I have an algorithm where the function has to be evaluated several times. However, due to the size of the equation system, each function call at a certain x needs a lot of time.

In [5]: import time                                                             

In [6]: def f(x): 
   ...:     global A 
   ...:     global b 
   ...:     start = time.time() 
   ...:     res = np.linalg.norm(A.dot(x) - b)**2 
   ...:     end = time.time() 
   ...:     return res, end - start 

In [7]: test = np.random.rand(200)                                             

In [8]: f(test)                                                                
Out[8]: (8820030785.528395, 7.467242956161499)

My question is:

Are there any possibilities for reducing the time of such a function call?

I thought about replacing the np.linalg.norm(A.dot(x) - b)**2 with a more efficient expression, but I don't know how this could look.

Technical information. The above code was written on a computer with

macOS Catalina version 10.15.5
2,3 GHz Dual‑Core Intel Core i5 (Turbo Boost up to 3,6 GHz) with 64 MB eDRAM
8 GB 2133 MHz LPDDR3 RAM (On‑Board)

  Memory:

  Memory Slots:

   ECC: Disabled
   Upgradeable Memory: No

     BANK 0/DIMM0:

       Size: 4 GB
       Type: LPDDR3
       Speed: 2133 MHz
       Status: OK (...)

     BANK 1/DIMM0:

       Size: 4 GB
       Type: LPDDR3
       Speed: 2133 MHz
       Status: OK (...)

The result of np.show_config() is

blas_mkl_info:
    libraries = ['blas', 'cblas', 'lapack', 'pthread', 'blas', 'cblas', 'lapack']
    library_dirs = ['/Users/me/miniconda3/envs/magpy/lib']
    define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)]
    include_dirs = ['/Users/me/miniconda3/envs/magpy/include']
blas_opt_info:
    libraries = ['blas', 'cblas', 'lapack', 'pthread', 'blas', 'cblas', 'lapack', 'blas', 'cblas', 'lapack']
    library_dirs = ['/Users/me/miniconda3/envs/magpy/lib']
    define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)]
    include_dirs = ['/Users/me/miniconda3/envs/magpy/include']
lapack_mkl_info:
    libraries = ['blas', 'cblas', 'lapack', 'pthread', 'blas', 'cblas', 'lapack']
    library_dirs = ['/Users/me/miniconda3/envs/magpy/lib']
    define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)]
    include_dirs = ['/Users/me/miniconda3/envs/magpy/include']
lapack_opt_info:
    libraries = ['blas', 'cblas', 'lapack', 'pthread', 'blas', 'cblas', 'lapack', 'blas', 'cblas', 'lapack']
    library_dirs = ['/Users/me/miniconda3/envs/magpy/lib']
    define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)]
    include_dirs = ['/Users/me/miniconda3/envs/magpy/include']

Hey, just as a sidenote: since you're in IPython already, you might wanna use macros like `%%time` or `%%timeit` for such measurements. — Jeronimo, Aug 02 '20 at 08:29
I don't have a suggestion for how to reword the math expression or make the function faster, but pass `A` and `b` into the function through arguments instead of `global` statements. You can use default arguments if you really want to save the extra typing during the function call. — BatWannaBe, Aug 02 '20 at 08:32
Working with a matrix of 720M element is not free. However, `A.dot` is abnormally slow on your machine. Can you provide the following information in the question: your operating system (eg. Windows, Linux, Mac), your processor reference and the number of processors (eg. 1 x Intel 9600KF), your memory size+speed as well as the number of DIMM (eg. 2 x DIMM of 8 Gio @ 3200 MT/s). These information help us to know the best possible time this operation should take and check if your function can be possibly improved or not. — Jérôme Richard, Aug 02 '20 at 08:46
@Jan Thank you. This is very useful. The problem is probably coming from the BLAS library used as your hardware is able to do this operation much more quickly. Can you also provide information relative to the BLAS library used (see [here](https://stackoverflow.com/questions/9000164/how-to-check-blas-lapack-linkage-in-numpy-and-scipy) to do that). The best would be to track the dynamic library files to know which BLAS implementation is used (eg. on my system it is the package blas3 for example). — Jérôme Richard, Aug 02 '20 at 09:53
@JérômeRichard Thank you, I also thought about this point. Shall I add to the question what is shown when using `np.show_config()`? Otherwise, how do I track the dynamic library files, i.e., what does this mean? — , Aug 02 '20 at 10:06
@Jan yes, it would be useful. For the tracking, I mean that you can check the `library_dirs` result directory and check for the `libblas` files in it in order to then to know which implementation is used. — Jérôme Richard, Aug 02 '20 at 11:18
@JérômeRichard I added the output. Additionally, I get the same time problem on my other Mac with a quad core, so maybe this is indeed a BLAS problem. — , Aug 02 '20 at 11:40
@Jan To be sure, can you check the result of `otool -L /Users/me/miniconda3/envs/magpy/lib/libblas.3.dylib` (I am no sure about the name nor the exact location of the dynamic library file) — Jérôme Richard, Aug 02 '20 at 15:47
@JérômeRichard Yes, I gave this into the command line and got `/Users/me/miniconda3/envs/magpy/lib/libblas.3.dylib: @rpath/libmkl_rt.dylib (compatibility version 0.0.0, current version 0.0.0) /usr/lib/libSystem.B.dylib (compatibility version 1.0.0, current version 1238.60.2)` — , Aug 02 '20 at 16:04

Jérôme Richard · Answer 1 · 2020-08-02T17:40:09.207

1

The performance issue seems to comes from the slow default implementation of BLAS.

The default BLAS implementation used on your machine is apparently the Intel MKL which is usually quite fast, but unexpectedly slow on your machine. Indeed, Based on the hardware information provided, the execution time should be about 170-200 ms and not 7.5 seconds.

You can try to switch to another BLAS implementation such as OpenBLAS, Apple Accelerate or BLIS. You can find information about how to do that here and here.

If the switch to another BLAS implementation does not fix the issue, you use the following fallback Numba implementation:

@njit(parallel=True)
def customMathOp(A, x, b):
    squareNorm = 0.0
    for i in prange(A.shape[0]):
        s = 0.0
        for j in range(A.shape[1]):
            s += A[i,j] * x[j]
        squareNorm += (s - b[i]) ** 2
    return squareNorm

def f(x):
    global A
    global b
    start = time.time()
    res = customMathOp(A, x, b)
    end = time.time()
    return res, end - start

This code is not as good as using numpy functions based on a fast BLAS implementation, but it should be still relatively fast (note that the first call to f will be a bit slow as the compilation time is included).

Note that using the type np.float32 for the arrays can speed up the execution by a factor of 2 although the result should also be less accurate.

edited Aug 02 '20 at 17:40

answered Aug 02 '20 at 17:27

Jérôme Richard

41,678
6
29
59

Thank you. My system seems to be strange. Using your example code, the first function evaluation takes approximately 7.5 seconds, while the next ones take between 14 and 21 seconds. – Aug 02 '20 at 19:20
1

With fastmath=True (SIMD-vectorization of the sum), The Numba version is actually a bit faster than the BLAS-numpy solution (224ms-> Numpy, 165ms-> Numba). – max9111 Aug 02 '20 at 19:30
Also with `fastmath = True` I am having function call times between 5 and 21 seconds. What is wrong here? – Aug 02 '20 at 19:40
@max9111 Good catch! The code can be faster than the BLAS since no big vectors has to be allocated & computed in RAM. – Jérôme Richard Aug 02 '20 at 20:38
@Jan This is very weird. Can you test the code in sequential (without `prange` and `parallel=True`)? Check also that your system in not low-power mode. Finally, maybe you do not have enough RAM and the OS is using swap memory (stored on the hard drive). Indeed, the `A` matrix take roughly 5.4 Gio and you have only 8 Gio. Can you tray with smaller matrices? – Jérôme Richard Aug 02 '20 at 20:38

score 0 · Answer 2 · answered Aug 02 '20 at 16:40

In your case np.linalg.norm is just

np.sqrt(dot(x,x))

So you might be better off doing:

temp = np.dot(A,x) - b         # temp = A@x-b
return np.dot(temp, temp)      # return temp@temp

Skipping the unnecessary sqrt/square. But compared to the initial A@x that might be minor issue.

On a rather plain Linux 4GB computer your test case gives me (upon creating A)

MemoryError: Unable to allocate 5.36 GiB for an array with shape (3600000, 200) and data type float64

While you apparently have enough memory, you may be pushing that boundary. In other SO we've seen that dot/@ with very large arrays slows down due to memory management issues. Often people gain speed by doing some sort of 'chunk' processing. That's easy if you are doing an matmul with a 3d 'batch'. Less obviously so with your norm case.

Cutting A size down by 10:

In [423]: A.shape                                                                                    
Out[423]: (360000, 200)
In [424]: temp = A@x-b; res = temp@temp                                                              
In [425]: res                                                                                        
Out[425]: 938613433.9717302
In [426]: np.linalg.norm(A.dot(x)-b)**2                                                              
Out[426]: 938613433.9717301

Not much different in times:

In [428]: timeit temp = A@x-b; res = temp@temp                                                       
85 ms ± 529 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [429]: timeit np.linalg.norm(A.dot(x)-b)**2                                                       
86.1 ms ± 1.61 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

In fact it's the A.dot(x) that dominates the timings; the rest is negligible.

Doubling the size of A, about doubles the times (175-180 range).

I'm not an expert of the libraries, but I believe MKL is one the faster options, which I don't have (but you seem to).

How to improve the performance of NumPy's BLAS-related functions?

2 Answers2