numpy.add.at slower than in-place add?

Question

Proceeding from one of my earlier posts, I noticed that the operation np.add.at(A, indices, B) is a lot slower than A[indices] += B.

def fast(A):
    n = A.shape[0]
    retval = np.zeros(2*n-1)
    for i in range(n):
        retval[slice(i,i+n)] += A[i, :]
    return retval

def slow(A):
    n = A.shape[0]
    retval = np.zeros(2*n-1)
    for i in range(n):
        np.add.at(retval, slice(i,i+n), A[i, :])
    return retval

def slower(A):
    n = A.shape[0]
    retval = np.zeros(2*n-1)
    indices = np.arange(n)
    indices = indices[:,None] + indices
    np.add.at(retval, indices, A) # bottleneck here
    return retval

My timings:

A = np.random.randn(10000, 10000)

timeit(lambda: fast(A), number=10) # 0.8852798199995959
timeit(lambda: slow(A), number=10) # 56.633683917999406
timeit(lambda: slower(A), number=10) # 57.763389584000834

Clearly, using the __iadd__ is a lot faster. However, the documentation for np.add.at states:

Performs unbuffered in place operation on operand ‘a’ for elements specified by ‘indices’. For addition ufunc, this method is equivalent to a[indices] += b, except that results are accumulated for elements that are indexed more than once.

Why is np.add.at so slow?

What is the use-case for this function?

Does this answer your question? [Numpy in-place operation performance](https://stackoverflow.com/questions/57024802) and [Why is a=a*100 almost two times faster than a*=100?](https://stackoverflow.com/questions/67180635). — Jérôme Richard, Apr 29 '21 at 13:56
@JérômeRichard Not really, I thought ```ufunc.at``` was equivalent with in-place operations. What you linked is a comparison between ```A += B``` and ```A + B```. I can understand why the later one here is slower because it has to allocate a temporary array. — Kevin, Apr 29 '21 at 14:01
It would seem as if ```np.add.at``` uses advanced indexing unnecessarily which would make a *copy* of the data. Can anyone confirm this theory? — Kevin, Apr 29 '21 at 14:23
`add.at` is useful when there are duplicates in the indexing array, and the `+=` does not equal the iterative alternative. That should be covered in the docs. — hpaulj, Apr 29 '21 at 15:26
Ok thanks, so it can use duplicate indices because ```add.at``` is an iterative solution? — Kevin, Apr 29 '21 at 15:57
Everything in `numpy` iterates, but the fast stuff does so under-the-covers in compiled code. `add.at` makes it clear that `+=` has problems because the solution is buffered. Effectively `temp=a[idx]+b` followed by `a[idx]=temp`. `add.at` does this unbuffered, so duplicate changes can accumulate. It is slower than the buffered, but still faster than a python level iteration. — hpaulj, Apr 29 '21 at 16:44
@hpaulj I edited my post and added the method ```slower``` which I think is a valid use case for the ```add.at```. As you can see, the python level iteration can be removed by using duplicate indices. Although it looks more readable, it seems no faster than using the python level iteration solutions. Is this because ```fast``` and ```slow``` are using ```slice``` ? — Kevin, Apr 29 '21 at 17:45
I added a test case that's closer to yours, but for a smaller array. Your `slower` might be slowing down due to memory management issues. — hpaulj, Apr 30 '21 at 16:40

hpaulj · Accepted Answer · 2021-04-30T16:46:49.407

add.at was intended for cases where indices contain duplicates and += does not produce the desired result

In [44]: A = np.zeros(5,int); idx = np.array([0,1,1,2,2,2,3,3,3,3])
In [45]: A[idx]+=1
In [46]: A
Out[46]: array([1, 1, 1, 1, 0])    # the duplicates in idx are ignored

With add.at:

In [47]: A = np.zeros(5,int); idx = np.array([0,1,1,2,2,2,3,3,3,3])
In [48]: np.add.at(A, idx, 1)
In [49]: A
Out[49]: array([1, 2, 3, 4, 0])

Same result as with an explicit iteration:

In [50]: A = np.zeros(5,int); idx = np.array([0,1,1,2,2,2,3,3,3,3])
In [51]: for i in idx: A[i]+=1
In [52]: A
Out[52]: array([1, 2, 3, 4, 0])

Some timings:

In [53]: %%timeit A = np.zeros(5,int); idx = np.array([0,1,1,2,2,2,3,3,3,3])
    ...: A[idx]+=1
3.65 µs ± 13.2 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

In [54]: %%timeit A = np.zeros(5,int); idx = np.array([0,1,1,2,2,2,3,3,3,3])
    ...: np.add.at(A, idx, 1)
6.47 µs ± 24.8 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

In [55]: %%timeit A = np.zeros(5,int); idx = np.array([0,1,1,2,2,2,3,3,3,3])
    ...: np.add.at(A, idx, 1)
    ...: for i in idx: A[i]+=1
15.6 µs ± 41.5 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

add.at is slower than += but better than the python iteration.

We could test variants such as A[:4]+1, A[:4]+=1, etc.

Anyways, don't use the add.at variant if you don't need it.

edit

Your example, simplified a bit:

In [108]: x = np.zeros(2*10-1)
     ...: for i in range(10):
     ...:     x[i:i+10] += 1
     ...: 
In [109]: x
Out[109]: 
array([ 1.,  2.,  3.,  4.,  5.,  6.,  7.,  8.,  9., 10.,  9.,  8.,  7.,
        6.,  5.,  4.,  3.,  2.,  1.])

So you are adding values to overlapping slices. We could replace the slices with an array:

In [110]: x = np.zeros(2*10-1)
     ...: for i in range(10):
     ...:     x[np.arange(i,i+10)] += 1
     ...: 
In [111]: x
Out[111]: 
array([ 1.,  2.,  3.,  4.,  5.,  6.,  7.,  8.,  9., 10.,  9.,  8.,  7.,
        6.,  5.,  4.,  3.,  2.,  1.])

No need to add your 'slow' case, add.at with slices because the indices don't have duplicates.

Creating all the indexes. += does not work because of buffering

In [112]: idx=np.arange(10); idx=(idx[:,None]+idx).ravel()
In [113]: y=np.zeros(2*10-1)
     ...: y[idx]+=1
In [114]: y
Out[114]: 
array([1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
       1., 1.])

add.at solves that:

In [115]: y=np.zeros(2*10-1)
     ...: np.add.at(y, idx, 1)
In [116]: y
Out[116]: 
array([ 1.,  2.,  3.,  4.,  5.,  6.,  7.,  8.,  9., 10.,  9.,  8.,  7.,
        6.,  5.,  4.,  3.,  2.,  1.])

And the full python iteration:

In [117]: y=np.zeros(2*10-1)
     ...: for i in idx: y[i]+=1
In [118]: 
In [118]: y
Out[118]: 
array([ 1.,  2.,  3.,  4.,  5.,  6.,  7.,  8.,  9., 10.,  9.,  8.,  7.,
        6.,  5.,  4.,  3.,  2.,  1.])

Now some timings.

The base line:

In [119]: %%timeit
     ...: x = np.zeros(2*10-1)
     ...: for i in range(10):
     ...:     x[i:i+10] += 1
     ...: 
50.5 µs ± 177 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

Advanced indexing slows this down some:

In [120]: %%timeit
     ...: x = np.zeros(2*10-1)
     ...: for i in range(10):
     ...:     x[np.arange(i,i+10)] += 1
     ...: 
75.2 µs ± 79.9 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

If it worked, one advanced-indexing += would be fastest:

In [121]: %%timeit
     ...: idx=np.arange(10); idx=(idx[:,None]+idx).ravel()
     ...: y=np.zeros(2*10-1)
     ...: y[idx]+=1
     ...: 
     ...: 
17.5 µs ± 693 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

Full python iteration is about the same as the looped arange case:

In [122]: %%timeit
     ...: idx=np.arange(10); idx=(idx[:,None]+idx).ravel()
     ...: y=np.zeros(2*10-1)
     ...: for i in idx: y[i]+=1
     ...: 
     ...: 
76.3 µs ± 2.51 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

add.at is slower than the flat +=, but still better than the base line:

In [123]: %%timeit
     ...: idx=np.arange(10); idx=(idx[:,None]+idx).ravel()
     ...: y=np.zeros(2*10-1)
     ...: np.add.at(y, idx,1)
     ...: 
     ...: 
29.4 µs ± 21.2 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

In my smaller test, your slower does best. But it's possible that it does not scale as well as base/fast. Your indices is much larger. Often for very large arrays, iteration on one dimension is faster due to reduced memory management load.

Thanks. Yes, It seems to ultimately be a memory management issue that ```np.add.at``` has for large arrays. Nice getting that confirmed. — Kevin, May 01 '21 at 18:49

Thomas Ahle · Answer 2 · 2023-05-02T17:58:10.050

I also had issues with np.add.at being slow. I ended up writing my own version based on sorting:

def add_at(A, indices, B):
    sorted_indices = np.argsort(indices)
    uniques, run_lengths = np.unique(indices[sorted_indices], return_counts=True)
    for i, length, end in zip(uniques, run_lengths, run_lengths.cumsum()):
        A[i] += B[sorted_indices[end-length:end]].sum(axis=0)

For small arrays this is slower than np.add.at, but for large arrays it's 20 times faster or more.

Small benchmark:

n, m, d = 5, 10, 3
A = np.zeros((n, d))
B = np.random.randn(m, d)
indices = np.random.randint(n, size=m)

%timeit np.add.at(A, indices, B)
7.6 µs ± 16 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
%timeit add_at(A, indices, B)
82.9 µs ± 2.2 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

Large benchmark:

n, m, d = 10**3, 10**6, 10**2
...
%timeit np.add.at(A, indices, B)
10.2 s ± 42.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit add_at(A, indices, B)
509 ms ± 50.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

There's also a common pattern, which is simple and faster than np.add.at, though it's still slower the the sorting approach:

def add_at_ind(A, indices, B):
    for i in np.unique(indices):
        A[i] += B[indices == i].sum(axis=0)

# Small
%timeit add_at_ind(A, indices, B)
56.1 µs ± 1.28 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
# Large
%timeit add_at_ind(A, indices, B)
3.3 s ± 101 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

numpy.add.at slower than in-place add?

2 Answers2

edit