It becomes pretty apparent that a = a[a[:, 0].argsort()]
is a bottleneck of all the competetive grouping algorithms, big thanks to Vincent J for clarifying this. Over 80% of processing time are just blown up on this argsort
method and there's no easy way to replace or optimise it. numba
package allows to speed up a lot of algorithms and, hopefully, argsort
will attract any efforts in the future. The remaining part of grouping can be improved significantly assuming indices of first column are small.
TL;DR
The remaining part of majority of grouping methods contains np.unique
method which is quite slow and excessive in cases values of groups are small. It's more efficient to replace it with np.bincount
which could be later improved in numba
.
There are some results of how the remaining part could be improved:
def _custom_return(unique_id, a, split_idx, return_groups):
'''Choose if you want to also return unique ids'''
if return_groups:
return unique_id, np.split(a[:,1], split_idx)
else:
return np.split(a[:,1], split_idx)
def numpy_groupby_index(a, return_groups=False):
'''Code refactor of method of Vincent J'''
u, idx = np.unique(a[:,0], return_index=True)
return _custom_return(u, a, idx[1:], return_groups)
def numpy_groupby_counts(a, return_groups=False):
'''Use cumsum of counts instead of index'''
u, counts = np.unique(a[:,0], return_counts=True)
idx = np.cumsum(counts)
return _custom_return(u, a, idx[:-1], return_groups)
def numpy_groupby_diff(a, return_groups=False):
'''No use of any np.unique options'''
u = np.unique(a[:,0])
idx = np.flatnonzero(np.diff(a[:,0])) + 1
return _custom_return(u, a, idx, return_groups)
def numpy_groupby_bins(a, return_groups=False):
'''Replace np.unique by np.bincount'''
bins = np.bincount(a[:,0])
nonzero_bins_idx = bins != 0
nonzero_bins = bins[nonzero_bins_idx]
idx = np.cumsum(nonzero_bins[:-1])
return _custom_return(np.flatnonzero(nonzero_bins_idx), a, idx, return_groups)
def numba_groupby_bins(a, return_groups=False):
'''Replace np.bincount by numba_bincount'''
bins = numba_bincount(a[:,0])
nonzero_bins_idx = bins != 0
nonzero_bins = bins[nonzero_bins_idx]
idx = np.cumsum(nonzero_bins[:-1])
return _custom_return(np.flatnonzero(nonzero_bins_idx), a, idx, return_groups)
So numba_bincount
works in the same way as np.bincount
and it's defined like so:
from numba import njit
@njit
def _numba_bincount(a, counts, m):
for i in range(m):
counts[a[i]] += 1
def numba_bincount(arr): #just a refactor of Python count
M = np.max(arr)
counts = np.zeros(M + 1, dtype=int)
_numba_bincount(arr, counts, len(arr))
return counts
Usage:
a = np.array([[1,275],[1,441],[1,494],[1,593],[2,679],[2,533],[2,686],[3,559],[3,219],[3,455],[4,605],[4,468],[4,692],[4,613]])
a = a[a[:, 0].argsort()]
>>> numpy_groupby_index(a, return_groups=False)
[array([275, 441, 494, 593]),
array([679, 533, 686]),
array([559, 219, 455]),
array([605, 468, 692, 613])]
>>> numpy_groupby_index(a, return_groups=True)
(array([1, 2, 3, 4]),
[array([275, 441, 494, 593]),
array([679, 533, 686]),
array([559, 219, 455]),
array([605, 468, 692, 613])])
Perfmormance tests
It takes ~30 seconds to sort 100M items on my computer (with 10 distincts IDs). Let's test how much time will methods of the remaining part take to run:
%matplotlib inline
benchit.setparams(rep=3)
sizes = [3*10**(i//2) if i%2 else 10**(i//2) for i in range(17)]
N = sizes[-1]
x1 = np.random.randint(0,10, size=N)
x2 = np.random.normal(loc=500, scale=200, size=N).astype(int)
a = np.transpose([x1, x2])
arr = a[a[:, 0].argsort()]
fns = [numpy_groupby_index, numpy_groupby_counts, numpy_groupby_diff, numpy_groupby_bins, numba_groupby_bins]
in_ = {s/1000000: (arr[:s], ) for s in sizes}
t = benchit.timings(fns, in_, multivar=True, input_name='Millions of events')
t.plot(logx=True, figsize=(12, 6), fontsize=14)

No doubt numba
-powered bincount is a new winner of datasets that contains small IDs. It helps to reduce grouping of sorted data ~5 times which is ~10% of total runtime.