Is there an equivalent of `sum()` builtin which uses augmented assignment?

Question

Is there any standard library/numpy equivalent of the following function:

def augmented_assignment_sum(iterable, start=0):
    for n in iterable:
        start += n
    return start

?

While sum(ITERABLE) is very elegant, it uses + operator instead of +=, which in case of np.ndarray objects may affect performance.

I have tested that my function may be as fast as sum() (while its equivalent using + is much slower). As it is a pure Python function, I guess its performance is still handicapped, thus I am looking for some alternative:

In [49]: ARRAYS = [np.random.random((1000000)) for _ in range(100)]

In [50]: def not_augmented_assignment_sum(iterable, start=0): 
    ...:     for n in iterable: 
    ...:         start = start + n 
    ...:     return start 
    ...:                                                                                                                                                                                                                                                                       

In [51]: %timeit not_augmented_assignment_sum(ARRAYS)                                                                                                                                                                                                                          
63.6 ms ± 8.88 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [52]: %timeit sum(ARRAYS)                                                                                                                                                                                                                                                   
31.2 ms ± 2.18 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [53]: %timeit augmented_assignment_sum(ARRAYS)                                                                                                                                                                                                                              
31.2 ms ± 4.73 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [54]: %timeit not_augmented_assignment_sum(ARRAYS)                                                                                                                                                                                                                          
62.5 ms ± 12.1 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [55]: %timeit sum(ARRAYS)                                                                                                                                                                                                                                                   
37 ms ± 9.51 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [56]: %timeit augmented_assignment_sum(ARRAYS)                                                                                                                                                                                                                              
27.7 ms ± 2.53 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

I have tried to use functools.reduce combined with operator.iadd, but its performace is similar:

In [79]: %timeit reduce(iadd, ARRAYS, 0)                                                                                                                                                                                                                                       
33.4 ms ± 11.6 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [80]: %timeit reduce(iadd, ARRAYS, 0)                                                                                                                                                                                                                                       
29.4 ms ± 2.31 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

I am also interested in memory efficiency, thus prefer augmented assignments as they not require creation of intermediate objects.

@DanielMesejo sadly `374 ms ± 83.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)` :-( Although it is considerably faster if `ARRAYS` is 2D array. — abukaj, Nov 14 '19 at 18:22
There is also [numpy.sum](https://docs.scipy.org/doc/numpy/reference/generated/numpy.sum.html#numpy.sum) — Dani Mesejo, Nov 14 '19 at 18:24
@DanielMesejo It returns a scalar unless called with `axis=0`. Then it takes `355 ms ± 16.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)` :-( Internally it uses `np.add.reduce()` (numpy v. 1.15.4) — abukaj, Nov 15 '19 at 08:21
What about a `np.dot(your_array, np.ones(len(your_array)))`. Should transfer to BLAS and be reasonably quick. — user228395, Nov 18 '19 at 22:50
In my runs `np.dot` is faster than the methods you presented, but after I installed MKL `np.sum` is fastest. — user228395, Nov 18 '19 at 23:06
Can't reproduce. For me `not_augmented_assignment_sum` seems consistently faster and very similar in performance to `sum`. What system are you on? (mine is PC Linux, Python3.6.5, numpy 1.17.0) — Paul Panzer, Nov 19 '19 at 02:59
How about `np.stack`ing your arrays and `np.sum`ming over the correct axis? Or do you WANT a scalar? (In which case, just stack and normal sum after) — Gloweye, Nov 19 '19 at 08:21
@PaulPanzer I tried to reproduce and realized I forgot about the definition of `ARRAYS` (edited). My benchmarking system is Ubuntu 18.04 (2xIntel Xeon CPU E5-2609 v2 @ 2.50GHz, 64G RAM), Python 3.6.7 Anaconda, numpy 1.15.4. — abukaj, Nov 19 '19 at 08:21
@user228395 `ARRAYS` is a sequence of many different arrays. Isn't `np.dot(...)` equivalent to `len(your_array) * your_array`? — abukaj, Nov 19 '19 at 08:25
@Gloweye `%timeit np.stack(ARRAYS).sum(axis=0)` is `330 ms ± 23.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)`. — abukaj, Nov 19 '19 at 08:30
Even fully optimized, I wouldn't expect NumPy to gain much here. You save some allocation and deallocation, but that was never dominating the runtime anyway. It matters more with stuff that behaves like list concatenation, where the outputs get bigger and bigger and `+=` can use an efficient growth scheme. — user2357112, Nov 19 '19 at 08:34
I have to state the obvious, if you care about performance, why not use `cython` / `numba` or your own c-extension ? — Uri Goren, Nov 24 '19 at 19:12
@UriGoren In the case of C_API the main problem is deployment of package for Windows, but right now I am curious, whether there is ready-made solution within current dependencies of my package. — abukaj, Nov 24 '19 at 22:53

Paul Panzer · Answer 1 · 2019-11-22T23:35:20.390

The answer to the headline question --- I hope @Martijn Pieters will forgive my choice of metaphor --- straight from the horse's mouth is: No, there is no such builtin.

If we allow for a few lines of code to implement such an equivalent we get a rather complicated picture with what is fastest very much depending on operand size:

This graph shows timings of different methods relative to sum over operand size, number of terms is always 100. augmented_assignment_sum starts paying off towards relatively large operand sizes. Using scipy.linalg.blas.*axpy looks pretty competitive over most of the range tested, its main drawback being being way less easy to use than sum.

Code:

from simple_benchmark import BenchmarkBuilder, MultiArgument
import numpy as np
from scipy.linalg import blas

B = BenchmarkBuilder()

@B.add_function()
def augmented_assignment_sum(iterable, start=0):
    for n in iterable:
        start += n
    return start

@B.add_function()
def not_augmented_assignment_sum(iterable, start=0):
    for n in iterable:
        start = start + n
    return start

@B.add_function()
def plain_sum(iterable, start=0):
    return sum(iterable,start)

@B.add_function()
def blas_sum(iterable, start=None):
    iterable = iter(iterable)
    if start is None:
        try:
            start = next(iterable).copy()
        except StopIteration:
            return 0
    try:
        f = {np.dtype('float32'):blas.saxpy,
             np.dtype('float64'):blas.daxpy,
             np.dtype('complex64'):blas.caxpy,
             np.dtype('complex128'):blas.zaxpy}[start.dtype]
    except KeyError:
        f = blas.daxpy
        start = start.astype(float)
    for n in iterable:
        f(n,start)
    return start

@B.add_arguments('size of terms')
def argument_provider():
    for exp in range(1,21):
        sz = int(2**exp)
        yield sz,[np.random.randn(sz) for _ in range(100)]

r = B.run()
r.plot(relative_to=plain_sum)

import pylab
pylab.savefig('inplacesum.png')

I know, technically not an answer to the headline qusetion, but I assume this is the kind of thing OP is interested in. — Paul Panzer, Nov 19 '19 at 11:15
Only one thing is missing to answer the headline question: a statement that there is no such function I am asking about. ;) — abukaj, Nov 19 '19 at 16:57
@MartijnPieters That may explain why I have failed to find one. ;) — abukaj, Nov 21 '19 at 18:21

Is there an equivalent of `sum()` builtin which uses augmented assignment?

1 Answers1