4

I'm looking for an efficient way to multiply a list of matrices in Numpy. I have a matrix like this:

import numpy as np
a = np.random.randn(1000, 4, 4)

I want to matrix-multiply along the long axis, so the result is a 4x4 matrix. So clearly I can do:

res = np.identity(4)
for ai in a:
    res = np.matmul(res, ai)

But this is super-slow. Is there a faster way (perhaps using einsum or some other function that I don't fully understand yet)?

benshepherd
  • 635
  • 7
  • 18
  • If I run your code (the calculation part) 1000 times, it takes 1.2 seconds, is that slow? Can you explain how fast it should be, what is your benchmark? – poppie Mar 15 '17 at 11:41
  • I'm just aware that Python list operations are generally slow compared to doing things in Numpy. – benshepherd Mar 15 '17 at 11:52
  • Owning to the dependency, I would say stick to the loopy code. If you really want to parallelize, you could use something like multiprocessing (I don't have much experience with it) and compute `np.matmul(a[i],[i+1])` as process-1, `np.matmul(a[i+2],[i+3])` as process-2 and so on and finally merge back. – Divakar Mar 15 '17 at 11:57
  • Sorry I'm confused - your example code uses only numpy (arrays, not matrices), no lists - so there should be no problem with slow lists? – poppie Mar 15 '17 at 12:04
  • Yeah, that's not clear on my part. I meant 'list' in the more general sense. I have a 'list' of '4x4 matrices', but that's just semantics - it's really a numpy array of shape (1000, 4, 4). I presume looping through it is still slow though. – benshepherd Mar 15 '17 at 12:07

2 Answers2

4

A solution that requires log_2(n) for loop interations for stacks with size of powers of 2 could be

while len(a) > 1:
    a = np.matmul(a[::2, ...], a[1::2, ...])

which essentially iteratively multiplies two neighbouring matrices together until there is only one matrix left, doing half of the remaining multiplications per iteration.

res = A * B * C * D * ...         # 1024 remaining multiplications

becomes

res = (A * B) * (C * D) * ...     # 512 remaining multiplications

becomes

res = ((A * B) * (C * D)) * ...   # 256 remaining multiplications

etc.

For non-powers of 2 you can do this for the first 2^n matrices and use your algorithm for the remaining matrices.

Nils Werner
  • 34,832
  • 7
  • 76
  • 98
  • Although, looking at the execution time it seems that this doesn't improve computation speed. :-) – Nils Werner Mar 15 '17 at 13:37
  • On my machine, it does increase the speed: `0.005s vs. 0.0008s` for `a = np.random.randn(1024, 4, 4)` compared to the method given in the question. Also `np.allclose` gives `True`. What gave you the idea of performance and numerical instability of your solution? – Michael H. Mar 15 '17 at 14:49
  • If I increase the number of matrices to e.g. `2048, 4, 4`, I only get `nan`s. – Nils Werner Mar 15 '17 at 14:52
  • You are right. The recursive method from the question then also yields `NaN`. Though this might have something to do with huge numbers (`10**257`) already occuring in the case of `1024` and not necessarily with the code of yours. – Michael H. Mar 15 '17 at 15:00
  • I am using OSX, maybe the non-speedup I am seeing has to do with the Accelerate BLAS library being used (it is, among other things, incapable of using multiprocessing). – Nils Werner Mar 16 '17 at 10:23
  • @Michael are you absolutely sure it computes faster? I can't reliably reproduce a speedup on OSX and Linux – Nils Werner Mar 16 '17 at 11:26
  • The times I commented before were computed with differences of `time.clock()` without the time for getting the random array. Now with `%timeit` I got at size 1024 including getting the random array 2.82ms vs. 1.6ms, so still faster. The generation of the random array takes 834µs (also using IPythons `%timeit`). I use Python 3.4.3 64Bit, NumPy 1.11.2 on Win10 64Bit with an AMD FX-4300. – Michael H. Mar 16 '17 at 11:53
2

np.linalg.multi_dot does this sort of chaining.

In [119]: a = np.random.randn(5, 4, 4)
In [120]: res = np.identity(4)
In [121]: for ai in a: res = np.matmul(res, ai)
In [122]: res
Out[122]: 
array([[ -1.04341835,  -1.22015464,   9.21459712,   0.97214725],
       [ -0.13652679,   0.61012689,  -0.07325689,  -0.17834132],
       [ -2.45684401,  -1.76347514,  12.41094524,   1.00411347],
       [ -8.36738671,  -6.5010718 ,  15.32489832,   3.62426123]])
In [123]: np.linalg.multi_dot(a)
Out[123]: 
array([[ -1.04341835,  -1.22015464,   9.21459712,   0.97214725],
       [ -0.13652679,   0.61012689,  -0.07325689,  -0.17834132],
       [ -2.45684401,  -1.76347514,  12.41094524,   1.00411347],
       [ -8.36738671,  -6.5010718 ,  15.32489832,   3.62426123]])

But it is slower, 92.3 µs per loop v 22.2 µs per loop. And for your 1000 item case, the test timing is still running.

After figuring out some 'optimal order' multi_dot does a recursive dot.

def _multi_dot(arrays, order, i, j):
    """Actually do the multiplication with the given order."""
    if i == j:
        return arrays[i]
    else:
        return dot(_multi_dot(arrays, order, i, order[i, j]),
                   _multi_dot(arrays, order, order[i, j] + 1, j))

In the 1000 item case this hits a recursion depth error.

hpaulj
  • 221,503
  • 14
  • 230
  • 353