Python: Vectorizing Matrix Multiplications in the Loops?

Question

I have an N-by-M array, at each entry of whom, I need to do some NumPy operations and put the result there.

Right now, I'm doing it the naive way with a double loop:

import numpy as np

N = 10
M = 11
K = 100

result = np.zeros((N, M))

is_relevant = np.random.rand(N, M, K) > 0.5
weight = np.random.rand(3, 3, K)
values1 = np.random.rand(3, 3, K)
values2 = np.random.rand(3, 3, K)

for i in range(N):
    for j in range(M):
        selector = is_relevant[i, j, :]
        result[i, j] = np.sum(
            np.multiply(
                np.multiply(
                    values1[..., selector],
                    values2[..., selector]
                ), weight[..., selector]
            )
        )

Since all the in-loop operations are simply NumPy operations, I think there must be a way to do this faster or loop-free.

score 3 · Accepted Answer · edited May 17 '19 at 18:21

3

We can use a combination of np.einsum and np.tensordot -

a = np.einsum('ijk,ijk,ijk->k',values1,values2,weight)
out = np.tensordot(a,is_relevant,axes=(0,2))

Alternatively, with one einsum call -

np.einsum('ijk,ijk,ijk,lmk->lm',values1,values2,weight,is_relevant)

And with np.dot and einsum -

is_relevant.dot(np.einsum('ijk,ijk,ijk->k',values1,values2,weight))

Also, play around with the optimize flag in np.einsum by setting it as True to use BLAS.

Timings -

In [146]: %%timeit
     ...: a = np.einsum('ijk,ijk,ijk->k',values1,values2,weight)
     ...: out = np.tensordot(a,is_relevant,axes=(0,2))
10000 loops, best of 3: 121 µs per loop

In [147]: %timeit np.einsum('ijk,ijk,ijk,lmk->lm',values1,values2,weight,is_relevant)
1000 loops, best of 3: 851 µs per loop

In [148]: %timeit np.einsum('ijk,ijk,ijk,lmk->lm',values1,values2,weight,is_relevant,optimize=True)
1000 loops, best of 3: 347 µs per loop

In [156]: %timeit is_relevant.dot(np.einsum('ijk,ijk,ijk->k',values1,values2,weight))
10000 loops, best of 3: 58.6 µs per loop

Very large arrays

For very large arrays, we can leverage numexpr to make use of multi-cores -

import numexpr as ne

a = np.einsum('ijk,ijk,ijk->k',values1,values2,weight)
out = np.empty((N, M))
for i in range(N):
    for j in range(M):
        out[i,j] = ne.evaluate('sum(is_relevant_ij*a)',{'is_relevant_ij':is_relevant[i,j], 'a':a})

edited May 17 '19 at 18:21

Sibbs Gambling

19,274
42
103
174

answered May 17 '19 at 14:54

Divakar

218,885
19
262
358

1

.. `(np.all(out == result) --> False` – wwii May 17 '19 at 15:01
2

@wwii We are dealing with floating pt numbers. So, it's better to use `np.allclose()`. – Divakar May 17 '19 at 15:02
1

Gets me every time. – wwii May 17 '19 at 15:03
Wow, that `einsum` one-liner is just gold -- much cleaner and clearer! – Sibbs Gambling May 17 '19 at 16:13
When my K is very big (560,000), and N and M are much smaller (like 100), even the `np.dot and einsum` method runs as slowly as my double loop. Is there a faster method for such cases (huge K)? – Sibbs Gambling May 17 '19 at 17:22
^ I found my `np.dot` in this case was just using a single core. Investigating why now. But I've checked my numpy is able to use multiple cores with MKL when executing `A = np.random.rand(30000,30000); B = np.dot(A, A)`. – Sibbs Gambling May 17 '19 at 17:29
1

@SibbsGambling Please check out edit at the end - `Very large arrays`. Would like to know the kind of speedups you might be getting with it. `MKL` should help too. Let me know if you get any boost with it. – Divakar May 17 '19 at 17:57
Thanks! I was able to get a 2x speed up with `numexpr`, but I'm still obsessed as to why my `np.dot` is using just one core for this case (it uses all cores when doing `A = np.random.rand(30000,30000); B = np.dot(A, A)`!). Should I open another question on this? – Sibbs Gambling May 17 '19 at 18:22
1

@SibbsGambling This might be relevant - https://stackoverflow.com/questions/50295180/? Feel free to ask a new one, if that doesn't answer it. – Divakar May 17 '19 at 18:23
Tried many other ways and different combinations of these approaches and found `numexpr` is really hard to beat, even with parallel threads that already have communication overhead minimized. Will go with this with large `K`, and the `einsum` one-liner for small `K`. Thanks! – Sibbs Gambling May 17 '19 at 19:22

score 2 · Answer 2 · answered May 17 '19 at 16:19

Another very simple option is just:

result = (values1 * values2 * weight * is_relevant[:, :, np.newaxis, np.newaxis]).sum((2, 3, 4))

Divakar's last solution is faster than this though. Timings for comparison:

%timeit np.tensordot(np.einsum('ijk,ijk,ijk->k',values1,values2,weight),is_relevant,axes=(0,2))
# 30.9 µs ± 1.71 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit np.einsum('ijk,ijk,ijk,lmk->lm',values1,values2,weight,is_relevant)
# 379 µs ± 486 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit np.einsum('ijk,ijk,ijk,lmk->lm',values1,values2,weight,is_relevant,optimize=True)
# 145 µs ± 1.89 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit is_relevant.dot(np.einsum('ijk,ijk,ijk->k',values1,values2,weight))
# 15 µs ± 124 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit (values1 * values2 * weight * is_relevant[:, :, np.newaxis, np.newaxis]).sum((2, 3, 4))
# 152 µs ± 1.4 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

Python: Vectorizing Matrix Multiplications in the Loops?

2 Answers2

Very large arrays