2

Say I have a num_indices * n indices matrix (in range(m)) and a num_indices * n values matrix, i.e.,

m, n = 100, 50
num_indices = 100000
indices = np.random.randint(0, m, (num_indices, n))
values = np.random.uniform(0, 1, (num_indices, n))

I want to get a result matrix in shape m * n, such that each column is indexed and summed over the corresponding columns in indices and values matrices. Here are some of the implementations:

  • The basic for loop
tic = time.time()
res1 = np.zeros((m, n))
for i in range(num_indices):
    for j in range(n):
        res1[indices[i, j], j] += values[i, j]
print(f'element-wise for loop used {time.time() - tic:.5f}s')
  • np.add.at, in multidimensional
tic = time.time()
res2 = np.zeros((m, n))
np.add.at(res2, (indices, np.arange(n)[None, :]), values)
print(f'np.add.at used {time.time() - tic:.5f}s')
  • np.add.at, with for loop
tic = time.time()
reslst = []
for colid in range(n):
    rescol = np.zeros((m, 1))
    np.add.at(rescol, (indices[:, colid], np.zeros(num_indices, dtype=int)), values[:, colid])
    reslst.append(rescol)
res3 = np.hstack(reslst)
print(f'np.add.at with loop used {time.time() - tic:.5f}s')
  • scipy.sparse, in multidimensional
from scipy import sparse
tic = time.time()
res4 = sparse.csr_matrix((values.ravel(), (indices.ravel(), np.tile(np.arange(n), num_indices))), shape=(m, n)).toarray()
print(f'sparse.csr used {time.time() - tic:.5f}s')
  • scipy.sparse, with for loop
tic = time.time()
reslst = []
for colid in range(n):
    rescol = sparse.csr_matrix((values[:, colid], (indices[:, colid], np.zeros(num_indices, dtype=int))), shape=(m, 1)).toarray()
    reslst.append(rescol)
res5 = np.hstack(reslst)
print(f'sparse.csr with loop used {time.time() - tic:.5f}s')

The results are all same:

assert np.isclose(res2, res1).all()
assert np.isclose(res3, res1).all()
assert np.isclose(res4, res1).all()
assert np.isclose(res5, res1).all()

However, the speed is weird, and not satisfying:

for loop used 1.69889s
np.add.at used 0.17287s
np.add.at with loop used 0.47767s
sparse.csr used 0.16847s
sparse.csr with loop used 0.09845s

My basic questions are:

  • Why is np.add.at so slow - even slower than scipy.sparse? I though numpy would benefit a lot, especially in multidimentional case.
  • For scipy.sparse, why multidimensional is even slower than for loop? Isn't concurrency used? (and why for numpy, multidimensional is faster?)
  • Overall, is there any faster way to implement what I want?
hpaulj
  • 221,503
  • 14
  • 230
  • 353
graphitump
  • 631
  • 1
  • 5
  • 13
  • I can't reproduce the `csr` loopy times. For me the loopy code takes about 20% longer – Daniel F Apr 28 '22 at 09:36
  • 1
    Most likely answer is that `csr` is more efficient at the `c`-code level due to one of its key use-cases (FEM) wanting fast accumulation of terms. On the other hand, `numpy` `ufunc` methods are more generally applied for dense matrices, and thus have more flexibility and checks built-in, which come with a time penalty. – Daniel F Apr 28 '22 at 09:44
  • I'd like to see these tests with `timeit` – hpaulj Apr 28 '22 at 14:43
  • 1
    @DanielF, I think the sparse code sorts the `raw` `coo` data to put duplicate indices together, and then does the sums. Mostly it uses `cython` for compiling. In this case there are lots of duplicates. – hpaulj Apr 28 '22 at 14:53
  • What do you mean by "concurrency" in "Isn't concurrency used?"? – Jérôme Richard Apr 28 '22 at 18:59
  • @JérômeRichard I was just curious why for loop is even faster than writing in matrix form (with advanced indexing). – graphitump Apr 28 '22 at 20:04

2 Answers2

1

Replicating this to better understand what's happening:

In [48]: m, n = 100, 50
    ...: num_indices = 100000
    ...: indices = np.random.randint(0, m, (num_indices, n))
    ...: values = np.random.uniform(0, 1, (num_indices, n))

The add.at case

In [49]: res2 = np.zeros((m, n))
    ...: np.add.at(res2, (indices, np.arange(n)[None, :]), values)
In [50]: res2
Out[50]: 
array([[472.25600004, 526.36528387, 493.10320355, ..., 548.43850955,
        497.3412013 , 508.51132267],
       ...])
In [51]: %%timeit
    ...: res2 = np.zeros((m, n))
    ...: np.add.at(res2, (indices, np.arange(n)[None, :]), values)
613 ms ± 11.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

The sparse approach

In [52]: from scipy import sparse
In [53]: res4 = sparse.csr_matrix((values.ravel(), (indices.ravel(), np.tile(np.
    ...: arange(n), num_indices))), shape=(m, n)).toarray()
In [54]: res4
Out[54]: 
array([[472.25600004, 526.36528387, 493.10320355, ..., 548.43850955,
        497.3412013 , 508.51132267],
       [....)

Values are the same and there's nothing pathological about them (i.e. not all 0s. Just the sums of a lot of floats.)

And indeed, the sparse times are a bit better:

In [55]: timeit res4 = sparse.csr_matrix((values.ravel(), (indices.ravel(), np.t
    ...: ile(np.arange(n), num_indices))), shape=(m, n)).toarray()
402 ms ± 22.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Usually I find that creating a sparse matrix takes longer than making a dense one, especially when the size is only (100,50). And in sense, it does take long, half a second. But it's short compared to the add.at.

Replacing the values with 1's, I confirm that on average, each array element has 1000 duplicates (I could have deduced that from the res2 values and their (0,1) random distribution).

In [57]: res5 = sparse.csr_matrix((np.ones_like(indices.ravel()), (indices.ravel
    ...: (), np.tile(np.arange(n), num_indices))), shape=(m, n)).toarray()
In [58]: res5
Out[58]: 
array([[ 971, 1038, 1004, ..., 1056,  988, 1030],
       [1016,  992,  949, ...,  994,  982, 1011],
       [ 984, 1014,  980, ..., 1001, 1053, 1057],
       ...,])

If I recall past sparse code explorations correctly, with these inputs, csr_matrix first makes a coo matrix, then sorts indices lexically - that is sorts on rows, and within those on columns. This puts duplicate indices together, which it can easily sum. Given the large number of duplicates it shouldn't be surprising that this is relatively efficient.

The randomness is just in the row indices. The sparse_with_loop takes advantage of this, by iterating on n, 50. So creating each (100,1) sparse matrix is easier; again just sorting on those row indices and doing the sum.

In [61]: %%timeit
    ...: reslst = []
    ...: for colid in range(n):
    ...:     rescol = sparse.csr_matrix((values[:, colid], (indices[:, colid], n
    ...: p.zeros(num_indices, dtype=int))), shape=(m, 1)).toarray()
    ...:     reslst.append(rescol)
    ...: res5 = np.hstack(reslst)
258 ms ± 452 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)

I might add that np.add.at, while faster than the full python loop, usually is slower than the buffered += that it corrects.

In [72]: %%timeit
    ...: res10 = np.zeros((m,n))
    ...: res10[indices, np.arange(n)] += values
103 ms ± 55.1 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

To put these times in perspective, the time to generate the values array is:

In [75]: timeit values = np.random.uniform(0, 1, (num_indices, n))

97.5 ms ± 90.4 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

about 1/6 of the add.at time. And the time to create a sparse matrix from that:

In [76]: timeit sparse.csr_matrix(np.random.uniform(0, 1, (num_indices, n)))
379 ms ± 4.51 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

So times on the order of 300 ms to perform the indexed addition are not out of the ordinary. Handling (m,n) shaped arrays, in any way, will take time.

The res5 case, counting duplicates can be done with bincount:

In [88]: np.array([np.bincount(indices[:,i]) for i in range(n)]).T
Out[88]: 
array([[ 971, 1038, 1004, ..., 1056,  988, 1030],
       [1016,  992,  949, ...,  994,  982, 1011],
       [ 984, 1014,  980, ..., 1001, 1053, 1057],
       ...,])
In [89]: timeit np.array([np.bincount(indices[:,i]) for i in range(n)]).T
64.7 ms ± 100 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

edit


Interestingly, if I make a coo_matrix times are quite a bit faster:

In [95]: timeit res4 = sparse.coo_matrix((values.ravel(), (indices.ravel(), np.t
    ...: ile(np.arange(n), num_indices))), shape=(m, n)).toarray()
78.3 ms ± 46.3 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

It's still doing the sum of duplicates step, but as part of the to_array() step. But it doesn't have to do the extra work of creating the csr matrix.

edit

I just remembered that previous add.at SO have used bincount with weights:

In [105]: res0 = np.array([np.bincount(indices[:,i],weights=values[:,i]) for i i
     ...: n range(n)]).T
In [106]: np.allclose(res2,res0)
Out[106]: True
In [107]: timeit res0 = np.array([np.bincount(indices[:,i],weights=values[:,i])
     ...: for i in range(n)]).T
142 ms ± 133 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
hpaulj
  • 221,503
  • 14
  • 230
  • 353
  • 2
    `+=` doesn't sum over repeated indices, so that doesn't work. `np.allclose(res2, res10)` returns `False` and if you look under the hood `res10` is all values <1 – Daniel F Apr 28 '22 at 16:27
  • 1
    That's what I meant by " usually is slower than the buffered += that it corrects.". I was making a time comparison, not a value one. `np.add.at` was developed because '+=' does not produce the correct values in the case of duplicate indices. The `add.at` docs clearly discuss that. But the unbuffered addition does have a time penalty. – hpaulj Apr 28 '22 at 16:33
  • Thanks Daniel and hpaulj. The coo_matrix seems to be quite faster (~8 times faster than np.add.at res2). - just to make sure, is `values` here all ones or uniform floats? – graphitump Apr 28 '22 at 17:46
  • And i was also thinking whether it's possible to use `np.bincount` in my case. – graphitump Apr 28 '22 at 17:47
  • 2
    `bincount` with `weights` works – hpaulj Apr 28 '22 at 17:55
1

Surprisingly, it turns out that np.add.at is very inefficiently implemented in Numpy.

Regarding scipy.sparse, I cannot reproduce the same performance improvement on my machine with Scipy 1.7.1: it is barely faster. Like Numpy's np.add.at, it is far from being fast.

You can implement efficiently this operation using Numba:

import numba as nb

@nb.njit('(int_[:,::1], float64[:,::1], int_, int_)')
def compute(indices, values, n, m):
    res = np.zeros((m, n), dtype=np.float64)
    for i in range(num_indices):
        for j in range(n):
            #assert 0 <= indices[i, j] < m
            res[indices[i, j], j] += values[i, j]
    return res

tic = time.time()
result = compute(indices, values, n, m)
print(f'Numba used {time.time() - tic:.5f}s')

Note that the function assume that indices contains valid values (ie. no out of bounds). Otherwise, the function can crash or can silently corrupt the program. You can enable the assertion if you are not sure about that at the expensive of a slower computation (twice slower on my machine).

Note that this implementation is fast as long as 8 * n * m is small enough so the output array fit in the L1 cache (generally from 16 KB to 64 KB). Otherwise, it can be a bit slower due to the inefficient random access pattern. If the output array doe not fit in the L2 cache (generally few hundreds of KB), then this will be significantly slower.

Results

element-wise for loop used 2.45158s
np.add.at used 0.28600s
sparse.csr used 0.19600s
sparse.csr with loop used 0.18900s
Numba used 0.00500s

Thus, Numba is about 40 times faster than the fastest implementations and ~500 times faster than the slowest one. The numba code is much faster since the code is compiled to an optimized native binary with no additional overhead (ie. bound-checking, temporary arrays, function calls) as opposed to Numpy and Scipy.

Jérôme Richard
  • 41,678
  • 6
  • 29
  • 59
  • Yes numba indeed works in the examples shown here. However, I just tried it in my real usage, where the `res` is in shape `(n=15, m=50)`, so `8 * n * m` is even smaller than the example shwon here. And `num_indices=300000` - seems similar to the example here. But, numba is quite slower than `csr_matrix` now (0.4s vs. 0.7s). Is there anything wrong? – graphitump Apr 28 '22 at 21:08
  • 1
    Very strange. I tested this on 2 different machine and got timing less than 0.01s on both. Assuming your target platform use a mainstream x86-64 processor (> 1GHz) with a mainstream memory based one and not an embedded system, the timing should not exceed few dozens of ms. Are you sure the Numba code is actually used? If there is an error, then Numba use a slow pure-Python fallback implementation. Is there any error/warning? Is the function still slow if called twice (Numba use a lazy compilation though it should not be used here). Are uniformly distributed values used like in the example? – Jérôme Richard Apr 28 '22 at 21:28
  • I saved the `indices` and `values` and reload them in a new script and compare between `numba` and `csr` - the result is normal: `numba` uses `109 ms`, `csr` uses `1079 ms`. However, I also profiled them in runtime of my using case: `numba` uses `972 ms`, `csr` uses `1104 ms`. So it seems normal for `csr` - but what happened to `numba`? I have doubled checked that the results are all same. – graphitump Apr 28 '22 at 22:34
  • I guess it may be because of the non-contiguous of data array in my use case's runtime? In the separate script, it also works if I changed `@nb.njit('(int_[:,:,:],..` to `@nb.njit('(int_[:,:,::1],..`. However, in my use case, if I changed to `[::1]`, an error will throw: `No matching definition for argument type(s) array(int64, 3d, A)`. – graphitump Apr 28 '22 at 22:39
  • `array(int64, 3d, A)` is a 3D non-contiguous array indeed. It can be implicitly converted to a contiguous one `array(int64, 3d, C)` if the stride in the last dimension is 1. Reading contiguous array is generally much faster since processors are designed to make such access fast (more cache-friendly). Doing a copy of the two arrays might help (especially to understand where is the source of the slowdown. There are some strategies to speed this a bit but large strided accesses tends to be slow and hard to optimize (see [this related post](https://stackoverflow.com/questions/71818324)) – Jérôme Richard Apr 28 '22 at 23:11