Since this question has a numpy tag I'll extend about possible ways to solve it in numpy
. In general, this is called a group by problem. There are many ways you can do this in numpy
. You can classify them into two categories:
The second type of solutions won't work in general if IDs of groups are large but this is a significant boost of np.unique
in case IDS are small.
You need to sort your data by the first column before you apply any kind of these methods:
a = np.array(a)
arr = a[a[:, 0].argsort()]
Then you can choose your method of grouping and a custom return:
def _custom_return(unique_id, a, split_idx, return_groups):
'''Choose if you want to also return unique ids'''
if return_groups:
return unique_id, np.split(a[:,1], split_idx)
else:
return np.split(a[:,1], split_idx)
def numpy_groupby_index(a, return_groups=True):
'''Code refactor of method of Vincent J'''
u, idx = np.unique(a[:,0], return_index=True)
return _custom_return(u, a, idx[1:], return_groups)
def numpy_groupby_bins(a, return_groups=True):
'''Significant boost of np.unique by np.bincount'''
bins = np.bincount(a[:,0])
nonzero_bins_idx = bins != 0
nonzero_bins = bins[nonzero_bins_idx]
idx = np.cumsum(nonzero_bins[:-1])
return _custom_return(np.flatnonzero(nonzero_bins_idx), a, idx, return_groups)
numpy_groupby_bins(arr, return_groups=True)
>>> (array([0, 1, 2]),
[array([ 1, 2, 26, 74]), array([77, 80, 81]), array([117, 118, 119, 120])])
numpy_groupby_bins(arr, return_groups=False)
>>> [array([ 1, 2, 26, 74]), array([77, 80, 81]), array([117, 118, 119, 120])]
numpy_groupby_index(arr, return_groups=True)
>>> (array([0, 1, 2]),
[array([ 1, 2, 26, 74]), array([77, 80, 81]), array([117, 118, 119, 120])])
numpy_groupby_index(arr, return_groups=False)
>>> [array([ 1, 2, 26, 74]), array([77, 80, 81]), array([117, 118, 119, 120])]
Note that all the methods contain np.split
method which is based on list.append
under the hood and hence it is not efficient in case you've got a big bunch of small groups. This happens because numpy is not designed to work with arrays of different lengths.
Also note that the output you expect requires one more iteration:
groups = numpy_groupby_index(arr, return_groups=True)
out = [np.r_[key, group] for key, group in zip(*groups)]
out
>>> [array([ 0, 1, 2, 26, 74]),
array([ 1, 77, 80, 81]),
array([ 2, 117, 118, 119, 120])]
If you're interested in performant solutions of this problem you could also read my further analysis on this kind of problem