performance loss after vectorization in numpy

Question

I am writing a time consuming program. To reduce the time, I have tried my best to use numpy.dot instead of for loops.

However, I found vectorized program to have much worse performance than the for loop version:

import numpy as np
import datetime
kpt_list = np.zeros((10000,20),dtype='float')
rpt_list = np.zeros((1000,20),dtype='float')
h_r = np.zeros((20,20,1000),dtype='complex')
r_ndegen = np.zeros(1000,dtype='float')
r_ndegen.fill(1)
# setup completed
# this is a the vectorized version
r_ndegen_tile = np.tile(r_ndegen.reshape(1000, 1), 10000)
start = datetime.datetime.now()
phase = np.exp(1j * np.dot(rpt_list, kpt_list.T))/r_ndegen_tile
kpt_data_1 = h_r.dot(phase)
end = datetime.datetime.now()
print((end-start).total_seconds())
# the result is 19.302483
# this is the for loop version
kpt_data_2 = np.zeros((20, 20, 10000), dtype='complex')
start = datetime.datetime.now()
for i in range(10000):
    kpt = kpt_list[i, :]
    phase = np.exp(1j * np.dot(kpt, rpt_list.T))/r_ndegen
    kpt_data_2[:, :, i] = h_r.dot(phase)
end = datetime.datetime.now()
print((end-start).total_seconds())
# the result is 7.74583

What is happening here?

Shouldn't you be outputting for the loopy version in a different variable than `kpt_data`? Might not affect the runtimes greatly, but for clarity between two versions. — Divakar, Feb 23 '16 at 12:59
Doesn't change performance much when the variable changes though — Flavian Hautbois, Feb 23 '16 at 13:16
I really can't find a reason for such a jump. In my opinion, numpy just does not support multi-dimensional arrays of dim > 2 using optimized code, but my best bet is that you should directly ask the numpy developers. — Flavian Hautbois, Feb 23 '16 at 13:58
Have you tried this with actual values (not zeros and 1s)? Are the results exactly the same? Otherwise, I agree with @FlavianHautbois that the `dot` source for multidimensional arrays is completely different and not optimized from the 1D or 2D cases. — djhoese, Feb 23 '16 at 14:56
@daveydave400, yeah, I was using actual values and much larger array sizes in my program. For loops are also prominent. My opinion is that `numpy.dot` should at least have the same performance as for loops. This is not what I expected and will confuse me when I choose my algorithms. — atbug, Feb 23 '16 at 15:23
Check out this guy's code: https://www.huyng.com/posts/faster-numpy-dot-product — Charlie Haley, Feb 23 '16 at 15:30
Also check out this question: http://stackoverflow.com/questions/19839539/faster-code-then-numpy-dot-for-matrix-multiplication — Charlie Haley, Feb 23 '16 at 15:30
@CharlieHaley I tried this guy's code but it does not give the right return type (2 dimensional matrix instead of 3 dimensional) — Flavian Hautbois, Feb 23 '16 at 15:35

ali_m · Accepted Answer · 2016-02-24T01:42:58.833

The first thing I suggest you do is break your script down into separate functions to make profiling and debugging easier:

def setup(n1=10000, n2=1000, n3=20, seed=None):

    gen = np.random.RandomState(seed)
    kpt_list = gen.randn(n1, n3).astype(np.float)
    rpt_list = gen.randn(n2, n3).astype(np.float)
    h_r = (gen.randn(n3, n3,n2) + 1j*gen.randn(n3, n3,n2)).astype(np.complex)
    r_ndegen = gen.randn(1000).astype(np.float)

    return kpt_list, rpt_list, h_r, r_ndegen


def original_vec(*args, **kwargs):

    kpt_list, rpt_list, h_r, r_ndegen = setup(*args, **kwargs)

    r_ndegen_tile = np.tile(r_ndegen.reshape(1000, 1), 10000)
    phase = np.exp(1j * np.dot(rpt_list, kpt_list.T)) / r_ndegen_tile
    kpt_data = h_r.dot(phase)

    return kpt_data


def original_loop(*args, **kwargs):

    kpt_list, rpt_list, h_r, r_ndegen = setup(*args, **kwargs)

    kpt_data = np.zeros((20, 20, 10000), dtype='complex')
    for i in range(10000):
        kpt = kpt_list[i, :]
        phase = np.exp(1j * np.dot(kpt, rpt_list.T)) / r_ndegen
        kpt_data[:, :, i] = h_r.dot(phase)

    return kpt_data

I would also highly recommend using random data rather than all-zero or all-one arrays, unless that's what your actual data looks like (!). This makes it much easier to check the correctness of your code - for example, if your last step is to multiply by a matrix of zeros then your output will always be all-zeros, regardless of whether or not there is a mistake earlier on in your code.

Next, I would run these functions through line_profiler to see where they are spending most of their time. In particular, for original_vec:

In [1]: %lprun -f original_vec original_vec()
Timer unit: 1e-06 s

Total time: 23.7598 s
File: <ipython-input-24-c57463f84aad>
Function: original_vec at line 12

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
    12                                           def original_vec(*args, **kwargs):
    13                                           
    14         1        86498  86498.0      0.4      kpt_list, rpt_list, h_r, r_ndegen = setup(*args, **kwargs)
    15                                           
    16         1        69700  69700.0      0.3      r_ndegen_tile = np.tile(r_ndegen.reshape(1000, 1), 10000)
    17         1      1331947 1331947.0      5.6      phase = np.exp(1j * np.dot(rpt_list, kpt_list.T)) / r_ndegen_tile
    18         1     22271637 22271637.0     93.7      kpt_data = h_r.dot(phase)
    19                                           
    20         1            4      4.0      0.0      return kpt_data

You can see that it spends 93% of its time computing the dot product between h_r and phase. Here, h_r is a (20, 20, 1000) array and phase is (1000, 10000). We're computing a sum product over the last dimension of h_r and the first dimension of phase (you could write this in einsum notation as ijk,kl->ijl).

The first two dimensions of h_r don't really matter here - we could just as easily reshape h_r into a (20*20, 1000) array before taking the dot product. It turns out that this reshaping operation by itself gives a huge performance improvement:

In [2]: %timeit h_r.dot(phase)
1 loop, best of 3: 22.6 s per loop

In [3]: %timeit h_r.reshape(-1, 1000).dot(phase)
1 loop, best of 3: 1.04 s per loop

I'm not entirely sure why this should be the case - I would have hoped that numpy's dot function would be smart enough to apply this simple optimization automatically. On my laptop the second case seems to use multiple threads whereas the first one doesn't, suggesting that it might not be calling multithreaded BLAS routines.

Here's a vectorized version that incorporates the reshaping operation:

def new_vec(*args, **kwargs):

    kpt_list, rpt_list, h_r, r_ndegen = setup(*args, **kwargs)

    phase = np.exp(1j * np.dot(rpt_list, kpt_list.T)) / r_ndegen[:, None]
    kpt_data = h_r.reshape(-1, phase.shape[0]).dot(phase)

    return kpt_data.reshape(h_r.shape[:2] + (-1,))

The -1 indices tell numpy to infer the size of those dimensions according to the other dimensions and the number of elements in the array. I've also used broadcasting to divide by r_ndegen, which eliminates the need for np.tile.

By using the same random input data, we can check that the new version gives the same result as the original:

In [4]: ans1 = original_loop(seed=0)

In [5]: ans2 = new_vec(seed=0)    
In [6]: np.allclose(ans1, ans2)
Out[6]: True

Some performance benchmarks:

In [7]: %timeit original_loop()
1 loop, best of 3: 13.5 s per loop

In [8]: %timeit original_vec()
1 loop, best of 3: 24.1 s per loop

In [5]: %timeit new_vec()
1 loop, best of 3: 2.49 s per loop

Update:

I was curious about why np.dot was so much slower for the original (20, 20, 1000) h_r array, so I dug into the numpy source code. The logic implemented in multiarraymodule.c turns out to be shockingly simple:

#if defined(HAVE_CBLAS)
    if (PyArray_NDIM(ap1) <= 2 && PyArray_NDIM(ap2) <= 2 &&
            (NPY_DOUBLE == typenum || NPY_CDOUBLE == typenum ||
             NPY_FLOAT == typenum || NPY_CFLOAT == typenum)) {
        return cblas_matrixproduct(typenum, ap1, ap2, out);
    }
#endif

In other words numpy just checks whether either of the input arrays has > 2 dimensions, and immediately falls back on a non-BLAS implementation of matrix-matrix multiplication. It seems like it shouldn't be too difficult to check whether the inner dimensions of the two arrays are compatible, and if so treat them as 2D and perform *gemm matrix-matrix multiplication on them. In fact there's an open feature request for this dating back to 2012, if any numpy devs are reading...

In the meantime, it's a nice performance trick to be aware of when multiplying tensors.

Update 2:

I forgot about np.tensordot. Since it calls the same underlying BLAS routines as np.dot on a 2D array, it can achieve the same performance bump, but without all those ugly reshape operations:

In [6]: %timeit np.tensordot(h_r, phase, axes=1)
1 loop, best of 3: 1.05 s per loop

Interesting find that one! I was writing a lengthy one with a hybrid (vect + loopy) approach for some minor improvement, but this is much better! Just to give it a bit of further boost, `numexpr` could be incorporated to get phase : `A = 1j * np.dot(rpt_list, kpt_list.T); phase = ne.evaluate("exp(A)")/ r_ndegen[:, None]`. — Divakar, Feb 23 '16 at 18:57
@ali_m, My opinion is that `np.dot` should have at least the same performance as for loops. Otherwise I have to check every implementation to get a good performance. — atbug, Feb 24 '16 at 01:16
I agree that `np.dot` could do with improvement, but for me the take-home message is not to make unfounded assumptions about the efficiency of library functions. There will never be any guarantee that `np.dot` is faster than a `for` loop under all circumstances. If performance matters then you should get in the habit of profiling your code and testing different implementations. — ali_m, Feb 24 '16 at 01:39
Another point I forgot to mention - you could also use `np.tensordot(h_r, phase, axes=1)`, which is just as fast as reshaping to 2D and using `np.dot` (it calls the same underlying BLAS routine), but avoids all the ugly reshaping operations. — ali_m, Feb 24 '16 at 01:39

score 0 · Answer 2 · edited May 23 '17 at 12:01

0

I suspect the first operation is hitting the the resource limit. May be you can benefit from these two questions: Efficient dot products of large memory-mapped arrays, and Dot product of huge arrays in numpy.

edited May 23 '17 at 12:01

Community

1
1

answered Feb 23 '16 at 16:58

hashmuke

3,075
2
18
29

performance loss after vectorization in numpy

2 Answers2

Update:

Update 2: