16

Normally I would invert an array of 3x3 matrices in a for loop like in the example below. Unfortunately for loops are slow. Is there a faster, more efficient way to do this?

import numpy as np
A = np.random.rand(3,3,100)
Ainv = np.zeros_like(A)
for i in range(100):
    Ainv[:,:,i] = np.linalg.inv(A[:,:,i])
Amro
  • 123,847
  • 25
  • 243
  • 454
katrasnikj
  • 3,151
  • 3
  • 16
  • 27
  • See here: http://stackoverflow.com/questions/211160/python-inverse-of-a-matrix – Robert Harvey Aug 15 '12 at 15:18
  • Also, have you had a look here? http://docs.scipy.org/doc/numpy/reference/routines.array-manipulation.html – Robert Harvey Aug 15 '12 at 15:20
  • I don't think you understood my question correctly. I'm asking how to invert not one, but many matrices -- an array of matrices efficiently. – katrasnikj Aug 15 '12 at 15:21
  • 4
    Are `for` loops really that slow in Python? – Robert Harvey Aug 15 '12 at 15:24
  • http://stackoverflow.com/q/5972976 – Robert Harvey Aug 15 '12 at 15:28
  • 13
    Inverting a 3x3 matrix using `inv` takes about 51.8 us for me. `for i in range(100): pass` takes 2.89 us, so the loop overhead for each inv is totally negligible. The time to compute a slice is about 1.2 us. I don't think `for` loop speed is a factor here, and only `timeit` data will convince me otherwise. – DSM Aug 15 '12 at 15:30
  • 2
    @DSM -- I think your comment is about as good of an answer as we're gonna get on this one. I think you should put that as an answer (along with an explanation about how `timeit` is your friend, and you should only really worry about optimizing your innermost loop when you have nested loops, etc, etc). – mgilson Aug 15 '12 at 15:38
  • Anyone every tried forming a block diagonal matrix of the 3x3s and using an efficient sparse inversion algorithm for something like this? – DrSkippy Aug 15 '12 at 16:16
  • @DSM Thank you for this explanation. You are correct, the loop takes only a small fraction of the total computation time. I was too quick in posting this question. I should have analysed my problem more thoroughly. – katrasnikj Aug 15 '12 at 18:02

4 Answers4

17

It turns out that you're getting burned two levels down in the numpy.linalg code. If you look at numpy.linalg.inv, you can see it's just a call to numpy.linalg.solve(A, inv(A.shape[0]). This has the effect of recreating the identity matrix in each iteration of your for loop. Since all your arrays are the same size, that's a waste of time. Skipping this step by pre-allocating the identity matrix shaves ~20% off the time (fast_inverse). My testing suggests that pre-allocating the array or allocating it from a list of results doesn't make much difference.

Look one level deeper and you find the call to the lapack routine, but it's wrapped in several sanity checks. If you strip all these out and just call lapack in your for loop (since you already know the dimensions of your matrix and maybe know that it's real, not complex), things run MUCH faster (Note that I've made my array larger):

import numpy as np
A = np.random.rand(1000,3,3)
def slow_inverse(A): 
    Ainv = np.zeros_like(A)

    for i in range(A.shape[0]):
        Ainv[i] = np.linalg.inv(A[i])
    return Ainv

def fast_inverse(A):
    identity = np.identity(A.shape[2], dtype=A.dtype)
    Ainv = np.zeros_like(A)

    for i in range(A.shape[0]):
        Ainv[i] = np.linalg.solve(A[i], identity)
    return Ainv

def fast_inverse2(A):
    identity = np.identity(A.shape[2], dtype=A.dtype)

    return array([np.linalg.solve(x, identity) for x in A])

from numpy.linalg import lapack_lite
lapack_routine = lapack_lite.dgesv
# Looking one step deeper, we see that solve performs many sanity checks.  
# Stripping these, we have:
def faster_inverse(A):
    b = np.identity(A.shape[2], dtype=A.dtype)

    n_eq = A.shape[1]
    n_rhs = A.shape[2]
    pivots = zeros(n_eq, np.intc)
    identity  = np.eye(n_eq)
    def lapack_inverse(a):
        b = np.copy(identity)
        pivots = zeros(n_eq, np.intc)
        results = lapack_lite.dgesv(n_eq, n_rhs, a, n_eq, pivots, b, n_eq, 0)
        if results['info'] > 0:
            raise LinAlgError('Singular matrix')
        return b

    return array([lapack_inverse(a) for a in A])


%timeit -n 20 aI11 = slow_inverse(A)
%timeit -n 20 aI12 = fast_inverse(A)
%timeit -n 20 aI13 = fast_inverse2(A)
%timeit -n 20 aI14 = faster_inverse(A)

The results are impressive:

20 loops, best of 3: 45.1 ms per loop
20 loops, best of 3: 38.1 ms per loop
20 loops, best of 3: 38.9 ms per loop
20 loops, best of 3: 13.8 ms per loop

EDIT: I didn't look closely enough at what gets returned in solve. It turns out that the 'b' matrix is overwritten and contains the result in the end. This code now gives consistent results.

Carl F.
  • 6,718
  • 3
  • 28
  • 41
  • Does the numpy array have to be contiguous and in a specific order ('C' or 'F') ? – Juh_ Aug 17 '12 at 12:44
  • Very nice. Could you do the same with eig :-) – Juh_ Aug 17 '12 at 12:44
  • @Juh_: I believe the order of the array matters. Use the default. I recently implemented something very similar where I wanted the minimum eigenvalue. Rather than computing it on each array, I looked up the analytical solution for a 2x2 and coded that up. It sped things up hundreds to thousands of times. – Carl F. Aug 17 '12 at 23:17
  • If A is positive definitive, instead of solving for `A,I` isn't it faster to decompose (Cholesky) `A` to `LL*` then solve `L,I` with `solve_triangular` or directly with `linalg.lapack.clapack.dtrtri`, then answer is one matrix multiply away? – dashesy Apr 22 '15 at 18:25
  • `linalg.inv` in scipy seems to use `getri` – dashesy Apr 22 '15 at 19:47
11

A few things have changed since this question was asked and answered, and now numpy.linalg.inv supports multidimensional arrays, handling them as stacks of matrices with matrix indices being last (in other words, arrays of shape (...,M,N,N)). This seems to have been introduced in numpy 1.8.0. Unsurprisingly this is by far the best option in terms of performance:

import numpy as np

A = np.random.rand(3,3,1000)

def slow_inverse(A):
    """Looping solution for comparison"""
    Ainv = np.zeros_like(A)

    for i in range(A.shape[-1]):
        Ainv[...,i] = np.linalg.inv(A[...,i])
    return Ainv

def direct_inverse(A):
    """Compute the inverse of matrices in an array of shape (N,N,M)"""
    return np.linalg.inv(A.transpose(2,0,1)).transpose(1,2,0)

Note the two transposes in the latter function: the input of shape (N,N,M) has to be transposed to shape (M,N,N) for np.linalg.inv to work, then the result has to be permuted back to shape (M,N,N).

A check and timing results using IPython, on python 3.6 and numpy 1.14.0:

In [5]: np.allclose(slow_inverse(A),direct_inverse(A))
Out[5]: True

In [6]: %timeit slow_inverse(A)
19 ms ± 138 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [7]: %timeit direct_inverse(A)
1.3 ms ± 6.39 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
4

Numpy-Blas calls are not always the fastest possibility

On problems where you have to calculate lots of inverses, eigenvalues, dot-products of small 3x3 matrices or similar cases, numpy-MKL which I use can often be outperformed by quite a margin.

This external Blas routines are usually made for problems with larger matrices, for smaller ones you can write out a standard algorithm or take a look at eg. Intel IPP.

Please keep also in mind that Numpy uses C-ordered arrays by default (last dimension changes fastest). For this example I took the code from Matrix inversion (3,3) python - hard coded vs numpy.linalg.inv and modified it a bit.

import numpy as np
import numba as nb
import time

@nb.njit(fastmath=True)
def inversion(m):
    minv=np.empty(m.shape,dtype=m.dtype)
    for i in range(m.shape[0]):
        determinant_inv = 1./(m[i,0]*m[i,4]*m[i,8] + m[i,3]*m[i,7]*m[i,2] + m[i,6]*m[i,1]*m[i,5] - m[i,0]*m[i,5]*m[i,7] - m[i,2]*m[i,4]*m[i,6] - m[i,1]*m[i,3]*m[i,8])
        minv[i,0]=(m[i,4]*m[i,8]-m[i,5]*m[i,7])*determinant_inv
        minv[i,1]=(m[i,2]*m[i,7]-m[i,1]*m[i,8])*determinant_inv
        minv[i,2]=(m[i,1]*m[i,5]-m[i,2]*m[i,4])*determinant_inv
        minv[i,3]=(m[i,5]*m[i,6]-m[i,3]*m[i,8])*determinant_inv
        minv[i,4]=(m[i,0]*m[i,8]-m[i,2]*m[i,6])*determinant_inv
        minv[i,5]=(m[i,2]*m[i,3]-m[i,0]*m[i,5])*determinant_inv
        minv[i,6]=(m[i,3]*m[i,7]-m[i,4]*m[i,6])*determinant_inv
        minv[i,7]=(m[i,1]*m[i,6]-m[i,0]*m[i,7])*determinant_inv
        minv[i,8]=(m[i,0]*m[i,4]-m[i,1]*m[i,3])*determinant_inv

    return minv

#I was to lazy to modify the code from the link above more thoroughly
def inversion_3x3(m):
    m_TMP=m.reshape(m.shape[0],9)
    minv=inversion(m_TMP)
    return minv.reshape(minv.shape[0],3,3)

#Testing
A = np.random.rand(1000000,3,3)

#Warmup to not measure compilation overhead on the first call
#You may also use @nb.njit(fastmath=True,cache=True) but this has also about 0.2s 
#overhead on fist call

Ainv = inversion_3x3(A)

t1=time.time()
Ainv = inversion_3x3(A)
print(time.time()-t1)

t1=time.time()
Ainv2 = np.linalg.inv(A)
print(time.time()-t1)

print(np.allclose(Ainv2,Ainv))

Performance

np.linalg.inv: 0.36  s
inversion_3x3: 0.031 s
max9111
  • 6,272
  • 1
  • 16
  • 33
  • 1
    For future readers, I think it is worth mentioning using the flag `parallel=True` and `prange` instead of `range` can significantly speed this up even more on most machines for such a big input matrix. – Jérôme Richard Jul 09 '22 at 21:49
3

For loops are indeed not necessarily much slower than the alternatives and also in this case, it will not help you much. But here is a suggestion:

import numpy as np
A = np.random.rand(100,3,3) #this is to makes it 
                            #possible to index 
                            #the matrices as A[i]
Ainv = np.array(map(np.linalg.inv, A))

Timing this solution vs. your solution yields a small but noticeable difference:

# The for loop:
100 loops, best of 3: 6.38 ms per loop
# The map:
100 loops, best of 3: 5.81 ms per loop

I tried to use the numpy routine 'vectorize' with the hope of creating an even cleaner solution, but I'll have to take a second look into that. The change of ordering in the array A is probably the most significant change, since it utilises the fact that numpy arrays are ordered column-wise and therefor a linear readout of the data is ever so slightly faster this way.

Dikkemik
  • 31
  • 4