Replicating this to better understand what's happening:
In [48]: m, n = 100, 50
...: num_indices = 100000
...: indices = np.random.randint(0, m, (num_indices, n))
...: values = np.random.uniform(0, 1, (num_indices, n))
The add.at
case
In [49]: res2 = np.zeros((m, n))
...: np.add.at(res2, (indices, np.arange(n)[None, :]), values)
In [50]: res2
Out[50]:
array([[472.25600004, 526.36528387, 493.10320355, ..., 548.43850955,
497.3412013 , 508.51132267],
...])
In [51]: %%timeit
...: res2 = np.zeros((m, n))
...: np.add.at(res2, (indices, np.arange(n)[None, :]), values)
613 ms ± 11.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
The sparse approach
In [52]: from scipy import sparse
In [53]: res4 = sparse.csr_matrix((values.ravel(), (indices.ravel(), np.tile(np.
...: arange(n), num_indices))), shape=(m, n)).toarray()
In [54]: res4
Out[54]:
array([[472.25600004, 526.36528387, 493.10320355, ..., 548.43850955,
497.3412013 , 508.51132267],
[....)
Values are the same and there's nothing pathological about them (i.e. not all 0s. Just the sums of a lot of floats.)
And indeed, the sparse times are a bit better:
In [55]: timeit res4 = sparse.csr_matrix((values.ravel(), (indices.ravel(), np.t
...: ile(np.arange(n), num_indices))), shape=(m, n)).toarray()
402 ms ± 22.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Usually I find that creating a sparse matrix takes longer than making a dense one, especially when the size is only (100,50). And in sense, it does take long, half a second. But it's short compared to the add.at
.
Replacing the values
with 1's, I confirm that on average, each array element has 1000 duplicates (I could have deduced that from the res2
values and their (0,1) random distribution).
In [57]: res5 = sparse.csr_matrix((np.ones_like(indices.ravel()), (indices.ravel
...: (), np.tile(np.arange(n), num_indices))), shape=(m, n)).toarray()
In [58]: res5
Out[58]:
array([[ 971, 1038, 1004, ..., 1056, 988, 1030],
[1016, 992, 949, ..., 994, 982, 1011],
[ 984, 1014, 980, ..., 1001, 1053, 1057],
...,])
If I recall past sparse code explorations correctly, with these inputs, csr_matrix
first makes a coo
matrix, then sorts indices lexically - that is sorts on rows, and within those on columns. This puts duplicate indices together, which it can easily sum. Given the large number of duplicates it shouldn't be surprising that this is relatively efficient.
The randomness is just in the row indices. The sparse_with_loop takes advantage of this, by iterating on n
, 50. So creating each (100,1) sparse matrix is easier; again just sorting on those row indices and doing the sum.
In [61]: %%timeit
...: reslst = []
...: for colid in range(n):
...: rescol = sparse.csr_matrix((values[:, colid], (indices[:, colid], n
...: p.zeros(num_indices, dtype=int))), shape=(m, 1)).toarray()
...: reslst.append(rescol)
...: res5 = np.hstack(reslst)
258 ms ± 452 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
I might add that np.add.at
, while faster than the full python loop, usually is slower than the buffered +=
that it corrects.
In [72]: %%timeit
...: res10 = np.zeros((m,n))
...: res10[indices, np.arange(n)] += values
103 ms ± 55.1 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
To put these times in perspective, the time to generate the values
array is:
In [75]: timeit values = np.random.uniform(0, 1, (num_indices, n))
97.5 ms ± 90.4 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
about 1/6 of the add.at
time. And the time to create a sparse matrix from that:
In [76]: timeit sparse.csr_matrix(np.random.uniform(0, 1, (num_indices, n)))
379 ms ± 4.51 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
So times on the order of 300 ms to perform the indexed addition are not out of the ordinary. Handling (m,n) shaped arrays, in any way, will take time.
The res5 case, counting duplicates can be done with bincount
:
In [88]: np.array([np.bincount(indices[:,i]) for i in range(n)]).T
Out[88]:
array([[ 971, 1038, 1004, ..., 1056, 988, 1030],
[1016, 992, 949, ..., 994, 982, 1011],
[ 984, 1014, 980, ..., 1001, 1053, 1057],
...,])
In [89]: timeit np.array([np.bincount(indices[:,i]) for i in range(n)]).T
64.7 ms ± 100 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
edit
Interestingly, if I make a coo_matrix
times are quite a bit faster:
In [95]: timeit res4 = sparse.coo_matrix((values.ravel(), (indices.ravel(), np.t
...: ile(np.arange(n), num_indices))), shape=(m, n)).toarray()
78.3 ms ± 46.3 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
It's still doing the sum of duplicates step, but as part of the to_array()
step. But it doesn't have to do the extra work of creating the csr
matrix.
edit
I just remembered that previous add.at
SO have used bincount
with weights
:
In [105]: res0 = np.array([np.bincount(indices[:,i],weights=values[:,i]) for i i
...: n range(n)]).T
In [106]: np.allclose(res2,res0)
Out[106]: True
In [107]: timeit res0 = np.array([np.bincount(indices[:,i],weights=values[:,i])
...: for i in range(n)]).T
142 ms ± 133 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)