18

I noticed that very strangely, np.sum is 10x slower than a hand written sum.

np.sum with axis:

p1 = np.random.rand(10000, 2)
def test(p1):
    return p1.sum(axis=1)
%timeit test(p1)

186 µs ± 4.21 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

np.sum without axis:

p1 = np.random.rand(10000, 2)
def test(p1):
    return p1.sum()
%timeit test(p1)

17.9 µs ± 236 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

+:

p1 = np.random.rand(10000, 2)
def test(p1):
    return p1[:,0] + p1[:,1]
%timeit test(p1)

15.8 µs ± 328 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

Multiplication:

p1 = np.random.rand(10000, 2)
def test(p1):
    return p1[:,0]*p1[:,1]
%timeit test(p1)

15.7 µs ± 701 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

I don't see any reason for this. Any idea why? My numpy version is 1.15.3.

EDIT: with 10000000:

np.sum (with axis): 202 ms (5 x)
np.sum (without axis): 12 ms
+ : 46 ms (1 x)
* : 44.3 ms 

So I guess there is some overhead playing around, to some extent...

beesleep
  • 1,322
  • 8
  • 18
  • I think you forgot to include the test for native multiplication... – Julien Oct 31 '18 at 23:12
  • 1
    A small portion of the overhead is probably related to the pairwise summation implementation. Only a small portion, though - `prod` doesn't do the pairwise thing, as far as I'm aware, and `prod` runs in about 5/6 the time of `sum` in my tests. I *think* NumPy is also using SIMD for the `+` and not the `sum`, but I'm not yet sure. – user2357112 Oct 31 '18 at 23:40
  • 1
    your "Multiplication" is doing something different… the others just use `p1` and basically ignore `p2` – Sam Mason Oct 31 '18 at 23:41
  • I believe most of the overhead is a combination of 1) using a general binary reduction implementation for the `sum`, where `+` gets to completely avoid looping in the axis-1 direction, and 2) NumPy not optimizing the outer loop for the `sum` as well as the loop for the `+`. – user2357112 Oct 31 '18 at 23:49
  • 2
    [Here's the source](https://github.com/numpy/numpy/blob/v1.15.3/numpy/core/src/umath/loops.c.src#L1662) for the pairwise summation routine. The number of elements summed is small enough that the routine should immediately go into the straightforward non-pairwise loop case, but the code path still seems to have some overhead over the code path used for things like `prod`. As previously stated, this is only a small portion of the overhead relative to `+`. – user2357112 Oct 31 '18 at 23:55
  • @SamMason edited. Now, there should be the same number of operations. Still surprizing that * is roughtly as fast as + – beesleep Oct 31 '18 at 23:56
  • 2
    @beesleep Multiplication and addition aren't that different in floating point; If anything, multiplication is a bit easier. It's different for integers of course. – Cubic Oct 31 '18 at 23:59
  • @Cubic, I didn't know that, thanks ! – beesleep Nov 01 '18 at 00:01
  • @user2357112, the big difference is when using axis argument... (of course). – beesleep Nov 01 '18 at 00:15
  • 1
    That is so weird, mainly because axis should return a memory-view of non-contiguous elements (this is hardly optimized code, because you're being very cache-unfriendly here). In fact, I can ~10x difference in the performance of the code just by changing `p1 = np.random.rand(10000, 2)` to `p1 = np.random.rand(2, 10000)` and `p1.sum(axis=1)` to `p1.sum(axis=0)`. – Alex Huszagh Nov 01 '18 at 00:24
  • 1
    @AlexanderHuszagh, changing the `order` to 'F' (with a copy if needed) also gives the faster times. – hpaulj Nov 01 '18 at 06:21
  • @hpaulj, Im' considering changing the way I store the coordinates from [[x, y], [x, y], ...] to [[x, x, ...][y, y,...], or do what you suggest – beesleep Nov 01 '18 at 12:01
  • @hpaulj Yeah, using a fortran order would be way faster as well (because it's all about how the performance deals with memory alignment). – Alex Huszagh Nov 01 '18 at 17:05

2 Answers2

14

The main difference is larger overhead when a.sum(axis=1) is calculated. Calculating a reduction (in this case sum) is not a trivial matter:

  • one has to take the round-off errors into account and thus uses pairwise summation to reduce it.
  • tiling is important for bigger arrays, as it makes the most out of the available cache
  • In order to be able to use the SIMD-instructions/out-of-order execution abilities of modern CPUs one should calculate multiple rows in parallel

I have discussed the topics above in more details for example here and here.

However, all this is not needed and not better than a naive summation if there are only two elements to add - you get the same result but with much less overhead and faster.

For only 1000 elements, the overhead of calling numpy functionality is probably higher than actually doing these 1000 additions (or multiplications for that matter, because on modern CPUs pipelined additions/multiplications have the same cost) -as you can see, that for 10^4 the running time is only about 2 times higher, a sure sign that overhead plays a bigger role for 10^3! In this answer the impact of overhead and cache misses is investigated in more details.

Let's take a look at profiler-result to see whether the theory above holds (I use perf):

For a.sum(axis=1):

  17,39%  python   umath.cpython-36m-x86_64-linux-gnu.so       [.] reduce_loop
  11,41%  python   umath.cpython-36m-x86_64-linux-gnu.so       [.] pairwise_sum_DOUBLE
   9,78%  python   multiarray.cpython-36m-x86_64-linux-gnu.so  [.] npyiter_buffered_reduce_iternext_ite
   9,24%  python   umath.cpython-36m-x86_64-linux-gnu.so       [.] DOUBLE_add
   4,35%  python   python3.6                                   [.] _PyEval_EvalFrameDefault
   2,17%  python   multiarray.cpython-36m-x86_64-linux-gnu.so  [.] _aligned_strided_to_contig_size8_src
   2,17%  python   python3.6                                   [.] lookdict_unicode_nodummy
   ...

The overhead of using reduce_loop, pairwise_sum_DOUBLE is dominating.

For a[:,0]+a[:,1]):

   7,24%  python   python3.6                                   [.] _PyEval_EvalF
   5,26%  python   python3.6                                   [.] PyObject_Mall
   3,95%  python   python3.6                                   [.] visit_decref
   3,95%  python   umath.cpython-36m-x86_64-linux-gnu.so       [.] DOUBLE_add
   2,63%  python   python3.6                                   [.] PyDict_SetDef
   2,63%  python   python3.6                                   [.] _PyTuple_Mayb
   2,63%  python   python3.6                                   [.] collect
   2,63%  python   python3.6                                   [.] fast_function
   2,63%  python   python3.6                                   [.] visit_reachab
   1,97%  python   python3.6                                   [.] _PyObject_Gen

As expected: Python overhead plays a big role, a simple DOUBLE_add is used.


There are less overhead when calling a.sum()

  • for once, reduce_loop isn't called for every row but only once, which means considerable less overhead.
  • no new resulting arrays are created, there is no longer need to write 1000 doubles to the memory.

so it can be expected, that a.sum() is faster (despite the fact, that 2000 and not 1000 addition must be made - but as we have seen it is mostly about overhead and the actual work - the additions aren't responsible for the big share of the running time).


Data obtaining by running:

perf record python run.py
perf report

and

#run.py
import numpy as np
a=np.random.rand(1000,2)

for _ in range(10000):
  a.sum(axis=1)
  #a[:,0]+a[:,1]
ead
  • 32,758
  • 6
  • 90
  • 153
  • 1
    Interesting, and slightly surprising to me as well. I would have expected some overhead, but not this much. Almost makes it seem worth it to implement a special case trivial reduction if the axis size is under some treshold. – Eelco Hoogendoorn Nov 01 '18 at 11:58
  • I agree it feels surprising. I noticed the same kind of overhead 10 x with linspace over arange – beesleep Nov 01 '18 at 12:04
0

Well for .sum() with axis vs without, with the axis has to generate an array of floats as long as your input, with an element for each row. This means that it has to call reduce() 10,000 times along axis=1. Without the axis argument it calculates the sum of every element into a single float, which is just one call to reduce through the flat representation of the array.

I'm not sure why the manual add function is faster, and I don't feel like digging through the source code but I think I have a pretty good guess. I believe that that the overhead comes from it having to perform reduce across axis=1 for each row, so 10,000 separate calls to reduce. In the manual add function, the axis split is performed just a single time when defining the parameters of the "+" function, and then each element of the split columns can be added together in parallel.

mevers303
  • 442
  • 3
  • 10
  • Although a good guess, NumPy doesn't actually have to duplicate the array of floats. NumPy has an extensive memory view system, which allows it to store an array with a reference-counted pointer to the original buffer, the start, the length, and the stride. To create an axis or do fancy indexing, no data duplications are **ever** necessary... You can verify this by checking `a = np.arange(100); b = a[::-1]; a.ctypes.data == b.ctypes.data`, which checks if the pointers to the start of the arrays `a` and `b` are the same, and it will print true. – Alex Huszagh Nov 01 '18 at 00:43
  • You can also verify this by the same code, except with setting a value: `a = np.arange(100); b = a[::-1]; a[4] = 5; b[2]`, which will print 5, despite the value normally being 4. – Alex Huszagh Nov 01 '18 at 00:47
  • So the overhead doesn't come from splitting the array, but the addition function can indeed add the arrays in parallel, correct? Which is faster than a ton of calls to reduce. – mevers303 Nov 01 '18 at 00:48
  • Honestly I'd have to check the implementation to be sure, but it would surprise me if NumPy did it that way. Seems extremely inefficient, since it could just double the stride, increment 1 pointer by 8, call a highly optimized C add routine, and call it a day in both cases (obviously, it must do a lot more work, since indexing (especially fancy indexing) and working with potentially multi-dimensional arrays is quite complex). – Alex Huszagh Nov 01 '18 at 00:54
  • 1
    Wait... You're actually right. It calls reduce under the hood. Not 10,000 time, but still... Mind blown... https://github.com/numpy/numpy/blob/9d0225b800df3c2b0bffee9960df87b15527509c/numpy/core/src/multiarray/calculation.c#L519 – Alex Huszagh Nov 01 '18 at 01:00