12

I'm getting some efficiency test results that I can't explain.

I want to assemble a matrix B whose i-th entries B[i,:,:] = A[i,:,:].dot(x), where each A[i,:,:] is a 2D matrix, and so is x.

I can do this three ways, to test performance I make random (numpy.random.randn) matrices A = (10,1000,1000), x = (1000,1200). I get the following time results:

(1) single multidimensional dot product

B = A.dot(x)

total time: 102.361 s

(2) looping through i and performing 2D dot products

   # initialize B = np.zeros([dim1, dim2, dim3])
   for i in range(A.shape[0]):
       B[i,:,:] = A[i,:,:].dot(x)

total time: 0.826 s

(3) numpy.einsum

B3 = np.einsum("ijk, kl -> ijl", A, x)

total time: 8.289 s

So, option (2) is the fastest by far. But, considering just (1) and (2), I don't see the big difference between them. How can looping through and doing 2D dot products be ~ 124 times faster? They both use numpy.dot. Any insights?

I include the code used for the above results just below:

import numpy as np
import numpy.random as npr
import time

dim1, dim2, dim3 = 10, 1000, 1200
A = npr.randn(dim1, dim2, dim2)
x = npr.randn(dim2, dim3)

# consider three ways of assembling the same matrix B: B1, B2, B3

t = time.time()
B1 = np.dot(A,x)
td1 = time.time() - t
print "a single dot product of A [shape = (%d, %d, %d)] with x [shape = (%d, %d)] completes in %.3f s" \
  % (A.shape[0], A.shape[1], A.shape[2], x.shape[0], x.shape[1], td1)


B2 = np.zeros([A.shape[0], x.shape[0], x.shape[1]])
t = time.time()
for i in range(A.shape[0]):
    B2[i,:,:] = np.dot(A[i,:,:], x)
td2 = time.time() - t
print "taking %d dot products of 2D dot products A[i,:,:] [shape = (%d, %d)] with x [shape = (%d, %d)] completes in %.3f s" \
  % (A.shape[0], A.shape[1], A.shape[2], x.shape[0], x.shape[1], td2)

t = time.time()
B3 = np.einsum("ijk, kl -> ijl", A, x)
td3 = time.time() - t
print "using np.einsum, it completes in %.3f s" % td3
albert
  • 8,285
  • 3
  • 19
  • 32
Bent Snowman
  • 123
  • 1
  • 7

3 Answers3

6

numpy.dot only delegates to a BLAS matrix multiply when the inputs each have dimension at most 2:

#if defined(HAVE_CBLAS)
    if (PyArray_NDIM(ap1) <= 2 && PyArray_NDIM(ap2) <= 2 &&
            (NPY_DOUBLE == typenum || NPY_CDOUBLE == typenum ||
             NPY_FLOAT == typenum || NPY_CFLOAT == typenum)) {
        return cblas_matrixproduct(typenum, ap1, ap2, out);
    }
#endif

When you stick your whole 3-dimensional A array into dot, NumPy takes a slower path, going through an nditer object. It still tries to get some use out of BLAS in the slow path, but the way the slow path is designed, it can only use a vector-vector multiply rather than a matrix-matrix multiply, which doesn't give the BLAS anywhere near as much room to optimize.

user2357112
  • 260,549
  • 28
  • 431
  • 505
5

With smaller dims 10,100,200, I get a similar ranking

In [355]: %%timeit
   .....: B=np.zeros((N,M,L))
   .....: for i in range(N):
              B[i,:,:]=np.dot(A[i,:,:],x)
   .....: 
10 loops, best of 3: 22.5 ms per loop
In [356]: timeit np.dot(A,x)
10 loops, best of 3: 44.2 ms per loop
In [357]: timeit np.einsum('ijk,km->ijm',A,x)
10 loops, best of 3: 29 ms per loop

In [367]: timeit np.dot(A.reshape(-1,M),x).reshape(N,M,L)
10 loops, best of 3: 22.1 ms per loop

In [375]: timeit np.tensordot(A,x,(2,0))
10 loops, best of 3: 22.2 ms per loop

the itererative is faster, though not by as much as in your case.

This is probably true as long as that iterating dimension is small compared to the other ones. In that case the overhead of iteration (function calls etc) is small compared to the calculation time. And doing all the values at once uses more memory.

I tried a dot variation where I reshaped A into 2d, thinking that dot does that kind of reshaping internally. I'm surprised that it is actually fastest. tensordot is probably doing the same reshaping (that code if Python readable).


einsum sets up a 'sum of products' iteration involving 4 variables, the i,j,k,m - that is dim1*dim2*dim2*dim3 steps with the C level nditer. So the more indices you have the larger the iteration space.

hpaulj
  • 221,503
  • 14
  • 230
  • 353
  • `dot` does many things under the hood, it is apparent that `np.dot(A,x)` is not calling BLAS and is somehow defaulting over to numpy's internal GEMM routine. Your reshape example is bypassing the looping mechanics and going straight to a conventional 2D GEMM call without copying any data, it will always be the fastest solution for reasonable sized problems given a good BLAS. On my laptop with MKL its ~50 times faster than einsum for problems of the original size. – Daniel Oct 08 '15 at 20:13
  • `tensordot` is doing the same reshaping. – hpaulj Oct 08 '15 at 21:02
  • Well tensordot forces a copy of the data internally, I would not recommend this option. – Daniel Oct 09 '15 at 14:13
  • Actually using `(10000, 64, 100)` with `dot` and no reshaping is very slow as well. This genuinely seems like it's not working properly. @Daniel I think you should flesh out your comment to answer, it looks interesting and right. – kabanus Aug 20 '18 at 18:19
  • @kabanus, details like this that affect timings are not part of the API, so can change without notice. Currently I find the `B` loop to be quite a bit faster. `tensordot` is similar. `A@x` is now another option, though not quite as fast. – hpaulj Aug 20 '18 at 18:51
2

I am not too familiar with numpy's C-API, and the numpy.dot is one such builtin function that used to be under _dotblas in earlier versions.

Nevertheless, here are my thoughts.

1) numpy.dot takes different paths for 2-dimensional arrays and n-dimensional arrays. From the numpy.dot's online documentation:

For 2-D arrays it is equivalent to matrix multiplication, and for 1-D arrays to inner product of vectors (without complex conjugation). For N dimensions it is a sum product over the last axis of a and the second-to-last of b

dot(a, b)[i,j,k,m] = sum(a[i,j,:] * b[k,:,m])

So for 2-D arrays you are always guaranteed to have one call to BLAS's dgemm, however for N-D arrays numpy might choose the multiplication axes for arrays which might not correspond to the fastest changing axis (as you can see from the excerpt I have posted), and as result the full power of dgemm could be missed out on.

2) Your A array is too large to be loaded on to CPU cache. In your example, you use A with dimensions (10,1000,1000) which gives

In [1]: A.nbytes
80000000
In [2]: 80000000/1024
78125

That is almost 80MB, much larger than your cache size. So again you lose most of dgemm's power right there.

3) You are also timing the functions somewhat imprecisely. The time function in Python is known to be not precise. Use timeit instead.

So having all the above points in mind, let's try experimenting with arrays that can be loaded on to the cache

dim1, dim2, dim3 = 20, 20, 20
A = np.random.rand(dim1, dim2, dim2)
x = np.random.rand(dim2, dim3)

def for_dot1(A,x):
    for i in range(A.shape[0]):
        np.dot(A[i,:,:], x)

def for_dot2(A,x):
    for i in range(A.shape[0]):
        np.dot(A[:,i,:], x)    

def for_dot3(A,x):
    for i in range(A.shape[0]):
        np.dot(A[:,:,i], x)  

and here are the timings that I get (using numpy 1.9.2 built against OpenBLAS 0.2.14):

In [3]: %timeit np.dot(A,x)
10000 loops, best of 3: 174 µs per loop
In [4]: %timeit np.einsum("ijk, kl -> ijl", A, x)
10000 loops, best of 3: 108 µs per loop
In [5]: %timeit np.einsum("ijk, lk -> ijl", A, x)
10000 loops, best of 3: 97.1 µs per loop
In [6]: %timeit np.einsum("ikj, kl -> ijl", A, x)
1000 loops, best of 3: 238 µs per loop
In [7]: %timeit np.einsum("kij, kl -> ijl", A, x)
10000 loops, best of 3: 113 µs per loop
In [8]: %timeit for_dot1(A,x)
10000 loops, best of 3: 101 µs per loop
In [9]: %timeit for_dot2(A,x)
10000 loops, best of 3: 131 µs per loop
In [10]: %timeit for_dot3(A,x)
10000 loops, best of 3: 133 µs per loop

Notice that there is still a time difference, but not in orders of magnitude. Also note the importance of choosing the axis of multiplication. Now perhaps, a numpy developer can shed some light on what numpy.dot actually does under the hood for N-D arrays.

romeric
  • 2,325
  • 3
  • 19
  • 35
  • This is 1). `time.time()` is valid as long as you are not dealing with functions that are micro/nano seconds in length. In your own example you show the axis argument is only a factor of 2/3 while the timing issues differ by a factor of 1000. Also, be mindful that for vendor BLAS GEMM (N^3 compute, N^2 data) caching should have relatively little variance. – Daniel Oct 08 '15 at 20:20
  • Yeah, for vendor BLAS caching has little effect. Thanks, it is clear now that, the actual reason for timing issue is due to Numpy calling its internal gemm rather than vendor BLAS one. – romeric Oct 09 '15 at 12:18
  • I think [this](https://github.com/numpy/numpy/blob/master/numpy/core/src/multiarray/multiarraymodule.c#L986) is the relevant line in the C source. It's called through [this function](https://github.com/numpy/numpy/blob/master/numpy/core/src/multiarray/methods.c#L1991). –  Oct 09 '15 at 14:53