Many small matrices speed-up for loops

Question

I have a large coordinate grid (vectors a and b), for which I generate and solve a matrix (10x10) equation. Is there a way for scipy.linalg.solve to accept vector input? So far my solution was to run for cycles over the coordinate arrays.

import time
import numpy as np
import scipy.linalg

N = 10
a = np.linspace(0, 1, 10**3)
b = np.linspace(0, 1, 2*10**3)
A = np.random.random((N, N))      # input matrix, not static

def f(a,b,n):     # array-filling function
   return a*b*n

def sol(x,y):   # matrix solver
   D = np.arange(0,N)
   B = f(x,y,D)**2 + f(x-1, y+1, D)      # source vector
   X = scipy.linalg.solve(A,B)
   return X    # output an N-size vector

start = time.time()

answer = np.zeros(shape=(a.size, b.size)) # predefine output array

for egg in range(a.size):             # an ugly double-for cycle
   for ham in range(b.size):
       aa = a[egg]
       bb = b[ham]
       answer[egg,ham] = sol(aa,bb)[0]

print time.time() - start

1) define your `A`, `B` arrays without a loop (see numpy tutorials on how to do that). 2) Assemble the `A` arrays for different coordinates, into one block diagonal matrix (see [`scipy.linalg.block_diag`](http://docs.scipy.org/doc/scipy-0.15.1/reference/generated/scipy.linalg.block_diag.html), and similarly concatenate B vectors. Finally, call `linalg.solve` once the result. — rth, Jul 22 '15 at 15:50
Thanks! 1) Yes, sorry for obscurity, the real function is more complicated. Removed the inner loop. 2) That would make A into a (1000*1000*10)**2. I don't think it is reasonable to invert such a matrix, even a block-diagonal one. — yevgeniy, Jul 22 '15 at 17:05
Thanks again! Vectorizing over only one axis makes a lot of sense to me now. A quick check yielded a ~30 times speedup compared to the loop method. To add to your answer, there is a `scipy.sparse.linalg.spsolve` routine which is optimized for sparse matrices unlike `linalg.solve`. — yevgeniy, Jul 22 '15 at 17:48
@yevgeniy you could post the solution you found as an answer... — Saullo G. P. Castro, Jul 23 '15 at 00:08
**Edit:** I mixed up two numbers, @rth method makes things ~30 times **slower**, actually. Thinking about it, it makes sense. A sparse matrix equation computational complexity is around O(N^3/2). Thus it is faster to run a double for loop (1000)^2 of small (dense) matrices (10)^3 = **10^9** than working with a large matrix (1000*1000*10)^3/2= **10^10.5**. — yevgeniy, Jul 23 '15 at 13:24
There is a lot wrong with this code; first of all, it wont run as-is. Secondly, you can get a large speedup by vectorizing all the loops; linalg.inv is a generalized ufunc, so it will invert a whole nd-array of matrices for you in a single call, should you want it to. But you don't appear to want to do any such thing; A is static, and the heavy-lifting of its inversion can be precomputed once, leaving you with a single vectorized linear product to solve your entire problem. But the bigger question is: what problem are you trying to solve in the first place? — Eelco Hoogendoorn, Jul 23 '15 at 14:26

score 3 · Answer 1 · answered Jul 23 '15 at 14:52

3

To illustrate my point about generalized ufuncs, and the ability to eliminate the loop over egg and ham, consider the following piece of code:

import numpy as np
A = np.random.randn(4,4,10,10)
AI = np.linalg.inv(A)
#check that generalized ufuncs work as expected
I = np.einsum('xyij,xyjk->xyik', A, AI)
print np.allclose(I, np.eye(10)[np.newaxis, np.newaxis, :, :])

answered Jul 23 '15 at 14:52

Eelco Hoogendoorn

10,459
1
44
42

Yes, this is the best approach here. My answer addresses a more general case, where the A array is not fixed for all loop elements and cannot therefore be inversed in advance. – rth Jul 23 '15 at 15:07
1

Yes, @rth has got it right, matrix A changes. I have to generate a new one for every coordinate point. My own attempts to vectorize generation+solution have failed. So the question is, can I generate an array of A matrices (and corresponding B vectors) and solve **Ax=B** instead of looping? – yevgeniy Jul 23 '15 at 15:58
1

Yes, the same properties which apply to np.linalg.inv, apply to np.linalg.solve. If you need to know the specifics of how to vectorize those matrices ,you need to provide more detail in your question, since right now the formulation of your question contains a single static matrix. – Eelco Hoogendoorn Jul 23 '15 at 23:50

rth · Accepted Answer · 2015-07-23T15:19:29.040

@yevgeniy You are right, efficiently solving multiple independent linear systems A x = b with scipy a bit tricky (assuming an A array that changes for every iteration).

For instance, here is a benchmark for solving 1000 systems of the form, A x = b, where A is a 10x10 matrix, and b a 10 element vector. Surprisingly, the approach to put all this into one block diagonal matrix and call scipy.linalg.solve once is indeed slower both with dense and sparse matrices.

import numpy as np
from scipy.linalg import block_diag, solve
from scipy.sparse import block_diag as sp_block_diag
from scipy.sparse.linalg import spsolve

N = 10
M = 1000 # number of coordinates 
Ai = np.random.randn(N, N) # we can compute the inverse here,
# but let's assume that Ai are different matrices in the for loop loop
bi = np.random.randn(N)

%timeit [solve(Ai, bi) for el in range(M)]
# 10 loops, best of 3: 32.1 ms per loop

Afull = sp_block_diag([Ai]*M, format='csr')
bfull = np.tile(bi, M)

%timeit Afull = sp_block_diag([Ai]*M, format='csr')
%timeit spsolve(Afull, bfull)

# 1 loops, best of 3: 303 ms per loop
# 100 loops, best of 3: 5.55 ms per loop

Afull = block_diag(*[Ai]*M) 

%timeit Afull = block_diag(*[Ai]*M)
%timeit solve(Afull, bfull)

# 100 loops, best of 3: 14.1 ms per loop
# 1 loops, best of 3: 23.6 s per loop

The solution of the linear system, with sparse arrays is faster, but the time to create this block diagonal array is actually very slow. As to dense arrays, they are simply slower in this case (and take lots of RAM).

Maybe I'm missing something about how to make this work efficiently with sparse arrays, but if you are keeping the for loops, there are two things that you could do for optimizations.

From pure python, look at the source code of scipy.linalg.solve : remove unnecessary tests and factorize all repeated operations outside of your loops. For instance, assuming your arrays are not symmetrical positives, we could do

from scipy.linalg import get_lapack_funcs

gesv, = get_lapack_funcs(('gesv',), (Ai, bi))

def solve_opt(A, b, gesv=gesv):
    # not sure if copying A and B is necessary, but just in case (faster if arrays are not copied)
    lu, piv, x, info = gesv(A.copy(), b.copy(), overwrite_a=False, overwrite_b=False)
    if info == 0:
        return x
    if info > 0:
        raise LinAlgError("singular matrix")
    raise ValueError('illegal value in %d-th argument of internal gesv|posv' % -info)

%timeit [solve(Ai, bi) for el in range(M)]
%timeit [solve_opt(Ai, bi) for el in range(M)]

# 10 loops, best of 3: 30.1 ms per loop
# 100 loops, best of 3: 3.77 ms per loop

which results in a 6.5x speed up.

If you need even better performance, you would have to port this for loop in Cython and interface the gesv BLAS functions directly in C, as discussed here, or better with the Cython API for BLAS/LAPACK in Scipy 0.16.

Edit: As @Eelco Hoogendoorn mentioned if your A matrix is fixed, there is a much simpler and more efficient approach.

It appears `scipy.sparse.block_diag` takes a very roundabout and general approach and is therefore too slow. Sparse arrays can beat the Python loop by construction the `data`, `indices` and `indptr` arrays directly. — , Jul 23 '15 at 15:51

te2be · Answer 3 · 2023-08-01T11:43:49.893

In addition to @rth part on directly using the blas functions, you can use Numba.

@numba.njit()
def numba_solve(A, b):
    return np.linalg.solve(A, b)
#first execution to trigger compilation, otherwise compiling is included in the execution time
numba_solve(Ai[0], bi[0])

Here we use the numba library which accelerates Python functions by compiling all possible part using Just-In-Time (JIT) complilation. Similar to solve_opt by rth it does vastly accelerate the tests around numpy.linalg.solve() and is the fastest method I tested (see table at the end).

Addition: If your matrix is singular (or can be) and you want to have the least squares solution scipy.linalg.lstsq is quite fast. Numba or using Blas functions directly was not much faster.

Here are the results for the suggested ideas and a few more:

avg. exec. time	solver
2.4 microseconds	`numba_solve(A, b)`
3.6 microseconds	`solve_opt(A, b,gesv=gesv)`
4.8 microseconds	`np.linalg.solve(A, b)`
13.5 microseconds	sparse blockdiag solver (by rth)
21.3 microseconds	`scipy.linalg.lstsq(A, b,lapack_driver="gelsy")[0]`
22.1 microseconds	`numba_lstsq(A, b)`
23.3 microseconds	`lstsq_opt(A, b,gelss=gelss)`
24.2 microseconds	`np.linalg.lstsq(A, b, rcond=-1)[0]`
24.2 microseconds	`scipy.linalg.solve(A, b)`
26.1 microseconds	`np.linalg.lstsq(A, b, rcond=None)[0]`
41.3 microseconds	`scipy.linalg.lstsq(A, b,lapack_driver="gelss")[0]`
42.6 microseconds	`np.linalg.pinv(A) @ b`
43.7 microseconds	`scipy.linalg.lstsq(A, b,lapack_driver="gelsd")[0]`
61.2 microseconds	`scipy.optimize.root(lambda x: A @ x - b, np.zeros(10), jac=lambda x: A, method='lm', tol=1e-1)`
64.6 microseconds	`scipy.linalg.pinv(A) @ b`
110.9 microseconds	`scipy.optimize.root(lambda x: A @ x - b, np.zeros(10), jac=lambda x: A, method='hybr', tol=1e-1)`
4935.9 microseconds	blockdiag solver (by rth)

Full source code under: https://gist.github.com/te2be/4a1c4fb4e436ecaa61d74e09788e429f

Edit: One can even get faster by using Numba on the loop. Here you have to use the Numba typed List, as Numba has problems with looping efficiently over python lists in this case.

@numba.njit(cache=True)
def loop_numba_solve(Ai, bi):
    for A, b in zip(Ai, bi):
        x = np.linalg.solve(A, b)
    return x
nb_Ai = numba.typed.List()
nb_bi = numba.typed.List()
for A, b in zip(Ai, bi):
    nb_Ai.append(A)
    nb_bi.append(b)
loop_numba_solve(nb_Ai, nb_bi)

resulting in:

avg. exec. time	solver
1.4 microseconds	`loop_numba_solve(nb_Ai, nb_bi)`

Many small matrices speed-up for loops

3 Answers3