Write double (triple) sum as inner product?

Question

Since my np.dot is accelerated by OpenBlas and Openmpi I am wondering if there was a possibility to write the double sum

for i in range(N):
     for j in range(N):
         B[k,l] += A[i,j,k,l] * X[i,j]

as an inner product. Right at the moment I am using

B = np.einsum("ijkl,ij->kl",A,X)

but unfortunately it is quite slow and only uses one processor. Any ideas?

Edit: I benchmarked the answers given until now with a simple example, seems like they are all in the same order of magnitude:

A = np.random.random([200,200,100,100])
X = np.random.random([200,200])
def B1():
    return es("ijkl,ij->kl",A,X) 
def B2():
    return np.tensordot(A, X, [[0,1], [0, 1]])
def B3():
    shp = A.shape
    return np.dot(X.ravel(),A.reshape(shp[0]*shp[1],1)).reshape(shp[2],shp[3])

%timeit B1()
%timeit B2()
%timeit B3()

1 loops, best of 3: 300 ms per loop
10 loops, best of 3: 149 ms per loop
10 loops, best of 3: 150 ms per loop

Concluding from these results I would choose np.einsum, since its syntax is still the most readable and the improvement with the other two are only a factor 2x. I guess the next step would be to externalize the code into C or fortran.

How slow is it ? are you using any bench-marking on it. It will make sense if you could post complete code over here. First part of `for` loop is very understandable but last line is not helping to understand the complete flow. Please post complete code. — Pralhad Narsinh Sonar, Jun 04 '15 at 13:26
Dear Pralhad, this is the complete code, you can choose whatever ndarray A,X you want. The purpose of this post is not to explain the einsum function, but rather to find an expression in terms of np.dot, since np.einsum is not parallized. There is no point of giving a benchmark, since I have nothing to compare np.einsum with (I could benchmark the for loop though, but since it is native python code it is several orders of magnitude slower anyway and therefore of no use). — varantir, Jun 04 '15 at 13:40
`np.einsum` is actually quite optimized in C and uses SSE vectorization, therefore you won't be able gain orders of magnitude in speed even with OpenBlas in single thread execution. A speedup by a factor of 2 with the proposed methods seems reasonable. — rth, Jun 04 '15 at 14:19
Though the runtimes in EDIT seem reasonable, there is a typo with `np.dot` approach : `A.reshape(shp[0]*shp[1],1)`, where it must be `-1` instead. — Divakar, Jun 04 '15 at 15:05
When summing over all dimensions of the second matrix (as you do here), I find that there is little benefit with using `tensordot`. However, `tensordot` shows a huge advantage when you are actually doing a tensor product (i.e. when both matrices have a dimension that is not summed over). Another way to say this is, if you can `reshape` the problem into a matrix-vector product, then `einsum` is just as fast. But, if you can only `reshape` the problem into a matrix-matrix product, then `tensordot` will be faster. — Will Martin, Jul 23 '15 at 18:29

Saullo G. P. Castro · Answer 1 · 2015-06-04T14:12:06.693

You can use np.tensordot():

np.tensordot(A, X, [[0,1], [0, 1]])

which does use multiple cores.

EDIT: it is insteresting to see how np.einsum and np.tensordot scale when increasing the size of the input arrays:

In [18]: for n in range(1, 31):
   ....:     A = np.random.rand(n, n+1, n+2, n+3)
   ....:     X = np.random.rand(n, n+1)
   ....:     print(n)
   ....:     %timeit np.einsum('ijkl,ij->kl', A, X)
   ....:     %timeit np.tensordot(A, X, [[0, 1], [0, 1]])
   ....:
1
1000000 loops, best of 3: 1.55 µs per loop
100000 loops, best of 3: 8.36 µs per loop
...
11
100000 loops, best of 3: 15.9 µs per loop
100000 loops, best of 3: 17.2 µs per loop
12
10000 loops, best of 3: 23.6 µs per loop
100000 loops, best of 3: 18.9 µs per loop
...
21
10000 loops, best of 3: 153 µs per loop
10000 loops, best of 3: 44.4 µs per loop

and it becomes clear the advantage of using tensordot for larger arrays.

how many cores are you using what library do you use under numpy? I have eight cores and use MKL and recall getting a much better improvement by switching to tensordot. — Will Martin, Jul 01 '15 at 23:50
@WillMartin I noticed that my computer uses 4 cores, with MKL NumPy... [they've done some improvements in `einsum` using SIMD programming](http://stackoverflow.com/a/18365665/832621), making it more competitve — Saullo G. P. Castro, Jul 02 '15 at 00:25

score 2 · Answer 2 · answered Jun 04 '15 at 13:49

2

You can use np.dot like so -

shp = A.shape
B_dot = np.dot(X.ravel(),A.reshape(shp[0]*shp[1],-1)).reshape(shp[2],shp[3])

answered Jun 04 '15 at 13:49

Divakar

218,885
19
262
358

Will Martin · Answer 3 · 2015-07-23T18:20:00.227

I find that tensordot outperforms einsum by a lot for some operations. I am using numpy from Anaconda with accelerate, and Intel's math kernel library (MKL) installed. Consider what happens when the second matrix has an extra dimension that is not summed over:

In [39]: A = np.random.random([200, 200, 100, 100])

In [40]: X = np.random.random([200, 200])

In [41]: Y = np.random.random([200, 200, 100])

A : (200, 200, 100, 100)

X : (200, 200)

Y : (200, 200, 100)

The operation I'm doing here is as follows:

A X ---> (100, 100)

A Y ---> (100, 100, 100)

The A Y operation basically has to do 100 of the A X operations and store each of them. Here is how tensor dot performs in this setting:

In [42]: %timeit tensordot(A, X, [(0,1), (0,1)])
1 loops, best of 3: 477 ms per loop

In [43]: %timeit tensordot(A, Y, [(0,1), (0,1)])
1 loops, best of 3: 1.35 s per loop

Stop and think about this for a second. In line [43] we just did 100 times the number of operations and it only took 3 times longer. I know MKL does some fancy use of CPU cache to avoid transferring from RAM. I suspect its reusing blocks of A for the extra 100 arrays of Y.

Using Einsum results in something more expected given that we have 100x more operations to do:

In [44]: %timeit einsum('ijkl,ij->kl', A, X)
1 loops, best of 3: 962 ms per loop

In [45]: %timeit einsum('ijkl,ijm->klm', A, Y)
1 loops, best of 3: 1min 45s per loop

It seems that einsum does very well when one of the argument arrays has all of its dimensions summed over. Using tensordot has huge performance gains when some dimensions are not summed over (analogous to np.outer but with multi-dimensional arrays).

Here is another example:

enter image description here

For the array operation:

50x1000x1000 X 50x1000x1000 -> 50x50

Using tensordot gives me 6 GFLOPS compared to 0.2 GFLOPS with einsum.

I think an important point is that Modern machines should be able to get in the 5-50 GFLOP range for large arrays. If you count operations and get less than that, check to see what library your using.

Write double (triple) sum as inner product?

3 Answers3

Linked