Why is vectorized version slower?

Question

I have a problem where I have to do the following calculation. I wanted to avoid the loop version, so I vectorized it. Why is the loop version actually fast than the vectorized version? Does anybody have an explanation for this.

thx

import numpy as np
from numpy.core.umath_tests import inner1d

num_vertices = 40000
num_pca_dims = 1000
num_vert_coords = 3
a = np.arange(num_vert_coords * num_vertices * num_pca_dims).reshape((num_pca_dims, num_vertices*num_vert_coords)).T

#n-by-3
norms = np.arange(num_vertices * num_vert_coords).reshape(num_vertices,-1)

#Loop version
def slowversion(a,norms):
    res_list = []
    for c_idx in range(a.shape[1]):
        curr_col = a[:,c_idx].reshape(-1,3)
        res = inner1d(curr_col, norms)
        res_list.append(res)
    res_list_conc = np.column_stack(res_list)
    return res_list_conc


#Fast version
def fastversion(a,norms):
    a_3 = a.reshape(num_vertices, 3, num_pca_dims)
    fast_res = np.sum(a_3 * norms[:,:,None], axis=1)
    return fast_res


res_list_conc = slowversion(a,norms)
fast_res = fastversion(a,norms)
assert np.all(res_list_conc == fast_res)

See http://stackoverflow.com/q/14566564/399704 which seems related — Aaron D, Feb 07 '14 at 20:26
40 million elements of double precision = 320 MB. Could it be that your "fast" version is making a copy of the entire thing? Or that it is accessing the elements in a non-optimal way (not taking advantage of cache coherence)? — Floris, Feb 07 '14 at 20:26
@AaronD - I believe you are right - the answer provided at the link is very relevant; it might almost be considered a duplicate. — Floris, Feb 07 '14 at 20:27
Note that `np.einsum('ijk,ij->ik',a.reshape(num_vertices, num_vert_coords, num_pca_dims), norms)` might be a little faster than your loop version. — DSM, Feb 07 '14 at 20:59

score 4 · Accepted Answer · edited May 23 '17 at 12:09

4

Your "slow code" is likely doing better because inner1d is a single optimized C++ function that can* make use of your BLAS implementation. Lets look at comparable timings for this operation:

np.allclose(inner1d(a[:,0].reshape(-1,3), norms), 
           np.sum(a[:,0].reshape(-1,3)*norms,axis=1))
True

%timeit inner1d(a[:,0].reshape(-1,3), norms)
10000 loops, best of 3: 200 µs per loop

%timeit np.sum(a[:,0].reshape(-1,3)*norms,axis=1)
1000 loops, best of 3: 625 µs per loop

%timeit np.einsum('ij,ij->i',a[:,0].reshape(-1,3), norms)
1000 loops, best of 3: 325 µs per loop

Using inner is quite a bit faster then the pure numpy operations. Note that einsum is almost twice as fast as pure numpy expressions and for good reason. As your loop is not that large and most of the FLOPS are in the inner computations the saving for the inner operation outweigh the cost of looping.

%timeit slowversion(a,norms)
1 loops, best of 3: 991 ms per loop

%timeit fastversion(a,norms)
1 loops, best of 3: 1.28 s per loop

#Thanks to DSM for writing this out
%timeit np.einsum('ijk,ij->ik',a.reshape(num_vertices, num_vert_coords, num_pca_dims), norms)
1 loops, best of 3: 488 ms per loop

Putting this back together we can see the overall advantage of the "slow version" wins out; however, using an einsum implementation, which is fairly optimized for this sort of thing, gives us a further speed increase.

*I don't see it right off in the code, but it is clearly threaded.

edited May 23 '17 at 12:09

Community

1
1

answered Feb 07 '14 at 22:18

Daniel

19,179
7
60
74

I can't see any evidence that `inner1d` is either threaded, or that it use BLAS. Based on the [source](https://github.com/numpy/numpy/blob/7b2f20b406d27364c812f7a81a9c901afbd3600c/numpy/core/src/umath/umath_tests.c.src) it just looks like a pretty straightforward C-loop. – ali_m Feb 08 '14 at 00:00
@ali_m I did not see anything that would indicate that either, hence the note. Regardless playing with the `inner1d` function on my laptop clearly shows it is threaded and a link to BLAS would be the most obvious reason. – Daniel Feb 08 '14 at 17:40
Hmm, that's odd - in my hands `inner1d` seems to be single-threaded in both numpy 1.7.1 and the current dev version. It definitely doesn't use BLAS (based on the output of `ldd umath_tests.so`). Anyway, I certainly don't doubt the main point of your answer. – ali_m Feb 08 '14 at 22:13
@ali_m That is an interesting point, the only other thing I can think of is there is some voodoo in the anaconda accelerate pro code. – Daniel Feb 08 '14 at 23:23
I suspect you may be right about this being special anaconda accelerate magic - I've never played around with it myself, but it certainly looks like it's capable of parallelizing deep numpy functions like this. At any rate, I'd be cautious about comparing your timing benchmarks to a 'vanilla' numpy installation. – ali_m Feb 09 '14 at 00:23

Why is vectorized version slower?

1 Answers1