Python Numpy matrix multiplication in high dimension

Question

I am trying to look for a matrix operation in numpy that would speed up the following calculation.

I have two 3D matrices A and B. the first dimension indicates the example, and both of them have n_examples examples. What I want to achieve is to dot product each example in A and B and sum the result:

import numpy as np

n_examples = 10
A = np.random.randn(n_examples, 20,30)
B = np.random.randn(n_examples, 30,5)
sum = np.zeros([20,5])
for i in range(len(A)):
  sum += np.dot(A[i],B[i])

Saullo G. P. Castro · Accepted Answer · 2014-05-10T07:01:20.523

4

This is a typical application for np.tensordot():

sum = np.tensordot(A, B, [[0,2],[0,1]])

Timing

Using the following code:

import numpy as np

n_examples = 100
A = np.random.randn(n_examples, 20,30)
B = np.random.randn(n_examples, 30,5)

def sol1():
    sum = np.zeros([20,5])
    for i in range(len(A)):
      sum += np.dot(A[i],B[i])
    return sum

def sol2():
    return np.array(map(np.dot, A,B)).sum(0)

def sol3():
    return np.einsum('nmk,nkj->mj',A,B)

def sol4():
    return np.tensordot(A, B, [[2,0],[1,0]])

def sol5():
    return np.tensordot(A, B, [[0,2],[0,1]])

Results:

timeit sol1()
1000 loops, best of 3: 1.46 ms per loop

timeit sol2()
100 loops, best of 3: 4.22 ms per loop

timeit sol3()
1000 loops, best of 3: 1.87 ms per loop

timeit sol4()
10000 loops, best of 3: 205 µs per loop

timeit sol5()
10000 loops, best of 3: 172 µs per loop

on my computer the tensordot() was the fastest solution and changing the order that the axes are evaluated did not change the results neither the performance.

edited May 10 '14 at 07:01

answered May 10 '14 at 06:47

Saullo G. P. Castro

56,802
26
179
234

Thanks for the detailed response! It does produce the fastest solution on my computer as well!. However, if you increase the matrix size (from `20x30`, `30x5` to around `600x300`,`300x10`) then `sol1()` becomes fastest again and is 5x faster than the `tensordot` solution. I wonder why looping in python would be faster than native C implementation such as `tensordot` – aha May 12 '14 at 14:43
@aha, that's a surprise for me too, I would expect the `tensordot()` to be faster. Did you also compare `sol4()` and `sol5()`, changing the order in which the axes are evaluated? Perhaps this can make a difference... – Saullo G. P. Castro May 12 '14 at 14:51
1

using matrix size of `600x300`,`300x10`, `sol1()` takes `16.5ms`, `sol4()` takes `113ms` and `sol5()` takes `89ms` – aha May 12 '14 at 15:23

score 2 · Answer 2 · edited May 23 '17 at 12:23

2

Ha, it can be done in just one line: np.einsum('nmk,nkj->mj',A,B).

See Einstein summation: http://docs.scipy.org/doc/numpy/reference/generated/numpy.einsum.html

Not the same problem but the idea is quite much the same, see discussions and alternative methods in this topic we just discussed: numpy multiply matrices preserve third axis

Don't name your variable sum, you override the build-in sum.

As @Jaime pointed out, the loop is actually faster for dimensions of these size. In fact a solution based on map and sum is, albeit simpler, even slower:

In [19]:

%%timeit
SUM = np.zeros([20,5])
for i in range(len(A)):
  SUM += np.dot(A[i],B[i])
10000 loops, best of 3: 115 µs per loop
In [20]:

%timeit np.array(map(np.dot, A,B)).sum(0)
1000 loops, best of 3: 445 µs per loop
In [21]:

%timeit np.einsum('nmk,nkj->mj',A,B)
1000 loops, best of 3: 259 µs per loop

Thing are different with larger dimension:

n_examples = 1000
A = np.random.randn(n_examples, 20,1000)
B = np.random.randn(n_examples, 1000,5)

And:

In [46]:

%%timeit
SUM = np.zeros([20,5])
for i in range(len(A)):
  SUM += np.dot(A[i],B[i])
1 loops, best of 3: 191 ms per loop
In [47]:

%timeit np.array(map(np.dot, A,B)).sum(0)
1 loops, best of 3: 164 ms per loop
In [48]:

%timeit np.einsum('nmk,nkj->mj',A,B)
1 loops, best of 3: 451 ms per loop

edited May 23 '17 at 12:23

Community

1
1

answered May 10 '14 at 04:23

CT Zhu

52,648
17
120
133

1

It is 50% slower than the OP's code for his problem size, and much worse for really large inputs. – Jaime May 10 '14 at 04:27
A slightly faster method for larger dimension, not impressively better. `einsum` becomes even slower. Must sleep now and hope to be enlightened by solutions from the west-coast. :P – CT Zhu May 10 '14 at 04:45
is it slower than the tensordot? – MartianMartian Jan 28 '16 at 06:24
Do you happen to know if any other languages' library would provide a more optimized solution? – MartianMartian Jan 28 '16 at 20:34
I would suggest you timeit your code, just make sure it is indeed slower for the calculation you are doing. See related discussion http://stackoverflow.com/questions/18365073/why-is-numpys-einsum-faster-than-numpys-built-in-functions. – CT Zhu Jan 28 '16 at 21:13
Also the `numpy` version is written in `C` https://github.com/numpy/numpy/blob/master/numpy/core/src/multiarray/einsum.c.src, you might ask the author for suggestions as well. – CT Zhu Jan 28 '16 at 21:20

Python Numpy matrix multiplication in high dimension

2 Answers2