140

Is there any function in numpy to group this array down below by the first column?

I couldn't find any good answer over the internet..

>>> a
array([[  1, 275],
       [  1, 441],
       [  1, 494],
       [  1, 593],
       [  2, 679],
       [  2, 533],
       [  2, 686],
       [  3, 559],
       [  3, 219],
       [  3, 455],
       [  4, 605],
       [  4, 468],
       [  4, 692],
       [  4, 613]])

Wanted output:

array([[[275, 441, 494, 593]],
       [[679, 533, 686]],
       [[559, 219, 455]],
       [[605, 468, 692, 613]]], dtype=object)
user
  • 5,370
  • 8
  • 47
  • 75
John Dow
  • 1,409
  • 2
  • 10
  • 7

11 Answers11

103

Inspired by Eelco Hoogendoorn's library, but without his library, and using the fact that the first column of your array is always increasing (if not, sort first with a = a[a[:, 0].argsort()])

>>> np.split(a[:,1], np.unique(a[:, 0], return_index=True)[1][1:])
[array([275, 441, 494, 593]),
 array([679, 533, 686]),
 array([559, 219, 455]),
 array([605, 468, 692, 613])]

I didn't "timeit" ([EDIT] see below) but this is probably the faster way to achieve the question :

  • No python native loop
  • Result lists are numpy arrays, in case you need to make other numpy operations on them, no new conversion will be needed
  • Complexity looks O(n) (with sort it goes O(n log(n))

[EDIT sept 2021] I ran timeit on my Macbook M1, for a table of 10k random integers. The duration is for 1000 calls.

>>> a = np.random.randint(5, size=(10000, 2))  # 5 different "groups"

# Only the sort
>>> a = a[a[:, 0].argsort()]
⏱ 116.9 ms

# Group by on the already sorted table
>>> np.split(a[:, 1], np.unique(a[:, 0], return_index=True)[1][1:])
⏱ 35.5 ms

# Total sort + groupby
>>> a = a[a[:, 0].argsort()]
>>> np.split(a[:, 1], np.unique(a[:, 0], return_index=True)[1][1:])
⏱ 153.0 ms 

# With numpy-indexed package (cf Eelco answer)
>>> npi.group_by(a[:, 0]).split(a[:, 1])
⏱ 353.3 ms

# With pandas (cf Piotr answer)
>>> df = pd.DataFrame(a, columns=["key", "val"]) # no timer for this line
>>> df.groupby("key").val.apply(pd.Series.tolist) 
⏱ 362.3 ms

# With defaultdict, the python native way (cf Piotr answer)
>>> d = defaultdict(list)
for key, val in a:
    d[key].append(val)
⏱ 3543.2 ms

# With numpy_groupies (cf Michael answer)
>>> aggregate(a[:,0], a[:,1], "array", fill_value=[])
⏱ 376.4 ms

Second timeit scenario, with 500 different groups instead of 5. I'm surprised about pandas, I ran several times, but it just behave badly in this scenario.

>>> a = np.random.randint(500, size=(10000, 2))

just the sort  141.1 ms
already_sorted 392.0 ms
sort+groupby   542.4 ms
pandas        2695.8 ms
numpy-indexed  800.6 ms
defaultdict   3707.3 ms
numpy_groupies 836.7 ms

[EDIT] I improved the answer thanks to ns63sr's answer and Behzad Shayegh (cf comment) Thanks also TMBailey for noticing complexity of argsort is n log(n).

Vincent J
  • 4,968
  • 4
  • 40
  • 50
  • Great answer. And so easy to remember! Well Ok the second one maybe not so much. Adding to my stash of tricks. – WestCoastProjects Dec 19 '20 at 18:41
  • 1
    Your approach to sort 2d array based one column is not true. use ```a = a[a.T[0,:].argsort()]``` instead. – Behzad Shayegh May 07 '21 at 18:37
  • true! this sort was shuffling second column. I edited the answer. thanks – Vincent J May 10 '21 at 14:42
  • 2
    If you have to do the sort, won't that bump the complexity up to O(n log n) if the items are not already sorted? – TMBailey Sep 02 '21 at 09:14
  • These examples and time comparisons made me realize Pandas is not such a bad option. – culebrón Nov 13 '21 at 14:22
  • I've put some effort lately in order to estimate if it's possible to create something faster than total sort + groupby. Indeed, total sort (`a = a[a[:, 0].argsort()]`) is an issue. First reason, it's O(n log n). Secondly, it consumes 80% of performing time. It's unlikely `np.argsort` going to be optimised soon by developers of `numpy` or `numba`. Hence, it's possible to speed up remaining 20% only. The main direction of further optimisation is to replace `np.unique` with `np.bincount` which could be improved even more if it's redesigned in `numba`. Remark: bincount won't work on large IDs. – mathfux Jan 30 '22 at 00:53
  • What's more, I wouldn't recommend to use `np.split` within design of OP's code because it creates an empty pythonic `list` and calls `list.append` on each slice of numpyic array [under the hood](https://stackoverflow.com/a/70797158/3044825). – mathfux Jan 30 '22 at 00:55
  • Concerning the efficiency of Pandas, actually, in your results, the major part of the time is due to the `apply` and `tolist` operations. The `groupby` itself is very efficient. I am not able to reproduce exactly your stats, but with the same example, sort + split is about 350µs. Pandas without the list conversion is 213µs. Moreover, the pandas groupby operation seems to be insensitive to N. To get the groups, you can use the `get_group` method of `DataFrameGroupBy`, but this one is slow. [Here are the results](https://github.com/gVallverdu/cookbook/blob/master/numpy_groupby.ipynb). – Ger Mar 28 '23 at 14:44
46

The numpy_indexed package (disclaimer: I am its author) aims to fill this gap in numpy. All operations in numpy-indexed are fully vectorized, and no O(n^2) algorithms were harmed during the making of this library.

import numpy_indexed as npi
npi.group_by(a[:, 0]).split(a[:, 1])

Note that it is usually more efficient to directly compute relevant properties over such groups (ie, group_by(keys).mean(values)), rather than first splitting into a list / jagged array.

Eelco Hoogendoorn
  • 10,459
  • 1
  • 44
  • 42
  • 2
    Thanks . The way I meant it is that using On2 algorithms is intrinsically painful, even to said algorithm itself. But yeah I guess you have to assume the On2 algorithm is also self aware about it's inferiority for the sentence to make sense.. – Eelco Hoogendoorn Oct 18 '18 at 09:35
  • 1
    "no `O(n^2)` algorithms were harmed" .. Why would you want to "be nice" to _them_ ? instead _do_ harm them: force them to "get more lean" – WestCoastProjects Sep 10 '19 at 15:59
  • Note that this implementation of `group_by` changes the order of the output groups so that they are sorted by the values of the parameter of the `group_by`. Pandas' `groupby` keeps the original order. – Roland Pihlakas Dec 13 '21 at 02:14
  • In case the group keys are integers so that `len(set(group_keys)) == max(group_keys) + 1 and min(group_keys) == 0` then you can restore the original order by manually indexing the returned array by the `groupby` parameter value again later though. `(_, result) = npi.group_by(group_keys).mean(values[:, :]); result = result[group_keys, :]` – Roland Pihlakas Dec 13 '21 at 02:20
29

Numpy is not very handy here because the desired output is not an array of integers (it is an array of list objects).

I suggest either the pure Python way...

from collections import defaultdict

%%timeit
d = defaultdict(list)
for key, val in a:
    d[key].append(val)
10.7 µs ± 156 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

# result:
defaultdict(list,
        {1: [275, 441, 494, 593],
         2: [679, 533, 686],
         3: [559, 219, 455],
         4: [605, 468, 692, 613]})

...or the pandas way:

import pandas as pd

%%timeit
df = pd.DataFrame(a, columns=["key", "val"])
df.groupby("key").val.apply(pd.Series.tolist)
979 µs ± 3.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

# result:
key
1    [275, 441, 494, 593]
2         [679, 533, 686]
3         [559, 219, 455]
4    [605, 468, 692, 613]
Name: val, dtype: object
Piotr
  • 2,029
  • 1
  • 13
  • 8
15
n = np.unique(a[:,0])
np.array( [ list(a[a[:,0]==i,1]) for i in n] )

outputs:

array([[275, 441, 494, 593], [679, 533, 686], [559, 219, 455],
       [605, 468, 692, 613]], dtype=object)
WestCoastProjects
  • 58,982
  • 91
  • 316
  • 560
Gioelelm
  • 2,645
  • 5
  • 30
  • 49
10

Simplifying the answer of Vincent J and considering the comment of HS-nebula one can use return_index = True instead of return_counts = True and get rid of the cumsum:

np.split(a[:,1], np.unique(a[:,0], return_index = True)[1])[1:]

Output

[array([275, 441, 494, 593]),
 array([679, 533, 686]),
 array([559, 219, 455]),
 array([605, 468, 692, 613])]
ns63sr
  • 329
  • 4
  • 9
  • 1
    What if the first column isn’t sorted? Can we somehow combine sorting it and creating the groups? – Vidak Jul 15 '19 at 06:29
  • @Vidak `a.sort(axis=0)` would sort the array inplace by its first column (assuming that indices are stored there) – ns63sr Sep 05 '19 at 19:47
  • @ns63sr What is `idx`? – jtlz2 Apr 06 '20 at 11:40
  • 1
    This answer doesn't produce the correct output. It works if you set `idx = a[:,0]` so that the full code is `np.split(a[:,1], np.unique(a[:,0], return_index = True)[1])[1:]` – m13op22 Jul 02 '20 at 14:11
  • Great solution, but it has a limitation. This doesn't work if there is a missing index, (let's say 2). It will only return a list that is 3 items long, but you will then not be able to access by index into the new list as there would be some indices missing. Any way to return an empty list for the non-present indices? – jpmorr Apr 30 '21 at 12:06
2

It becomes pretty apparent that a = a[a[:, 0].argsort()] is a bottleneck of all the competetive grouping algorithms, big thanks to Vincent J for clarifying this. Over 80% of processing time are just blown up on this argsort method and there's no easy way to replace or optimise it. numba package allows to speed up a lot of algorithms and, hopefully, argsort will attract any efforts in the future. The remaining part of grouping can be improved significantly assuming indices of first column are small.

TL;DR

The remaining part of majority of grouping methods contains np.unique method which is quite slow and excessive in cases values of groups are small. It's more efficient to replace it with np.bincount which could be later improved in numba. There are some results of how the remaining part could be improved:

def _custom_return(unique_id, a, split_idx, return_groups):
    '''Choose if you want to also return unique ids'''
    if return_groups:
        return unique_id, np.split(a[:,1], split_idx)
    else: 
        return np.split(a[:,1], split_idx)

def numpy_groupby_index(a, return_groups=False):
    '''Code refactor of method of Vincent J'''
    u, idx = np.unique(a[:,0], return_index=True) 
    return _custom_return(u, a, idx[1:], return_groups)

def numpy_groupby_counts(a, return_groups=False):
    '''Use cumsum of counts instead of index'''
    u, counts = np.unique(a[:,0], return_counts=True)
    idx = np.cumsum(counts)
    return _custom_return(u, a, idx[:-1], return_groups)

def numpy_groupby_diff(a, return_groups=False):
    '''No use of any np.unique options'''
    u = np.unique(a[:,0])
    idx = np.flatnonzero(np.diff(a[:,0])) + 1
    return _custom_return(u, a, idx, return_groups)

def numpy_groupby_bins(a, return_groups=False):  
    '''Replace np.unique by np.bincount'''
    bins = np.bincount(a[:,0])
    nonzero_bins_idx = bins != 0
    nonzero_bins = bins[nonzero_bins_idx]
    idx = np.cumsum(nonzero_bins[:-1])
    return _custom_return(np.flatnonzero(nonzero_bins_idx), a, idx, return_groups)

def numba_groupby_bins(a, return_groups=False):  
    '''Replace np.bincount by numba_bincount'''
    bins = numba_bincount(a[:,0])
    nonzero_bins_idx = bins != 0
    nonzero_bins = bins[nonzero_bins_idx]
    idx = np.cumsum(nonzero_bins[:-1])
    return _custom_return(np.flatnonzero(nonzero_bins_idx), a, idx, return_groups)

So numba_bincount works in the same way as np.bincount and it's defined like so:

from numba import njit

@njit
def _numba_bincount(a, counts, m):
    for i in range(m):
        counts[a[i]] += 1

def numba_bincount(arr): #just a refactor of Python count
    M = np.max(arr)
    counts = np.zeros(M + 1, dtype=int)
    _numba_bincount(arr, counts, len(arr))
    return counts

Usage:

a = np.array([[1,275],[1,441],[1,494],[1,593],[2,679],[2,533],[2,686],[3,559],[3,219],[3,455],[4,605],[4,468],[4,692],[4,613]])
a = a[a[:, 0].argsort()]
>>> numpy_groupby_index(a, return_groups=False)
[array([275, 441, 494, 593]),
 array([679, 533, 686]),
 array([559, 219, 455]),
 array([605, 468, 692, 613])]
>>> numpy_groupby_index(a, return_groups=True)
(array([1, 2, 3, 4]),
 [array([275, 441, 494, 593]),
  array([679, 533, 686]),
  array([559, 219, 455]),
  array([605, 468, 692, 613])])

Perfmormance tests

It takes ~30 seconds to sort 100M items on my computer (with 10 distincts IDs). Let's test how much time will methods of the remaining part take to run:

%matplotlib inline
benchit.setparams(rep=3)

sizes = [3*10**(i//2) if i%2 else 10**(i//2) for i in range(17)]
N = sizes[-1]
x1 = np.random.randint(0,10, size=N)
x2 = np.random.normal(loc=500, scale=200, size=N).astype(int)
a = np.transpose([x1, x2])

arr = a[a[:, 0].argsort()]
fns = [numpy_groupby_index, numpy_groupby_counts, numpy_groupby_diff, numpy_groupby_bins, numba_groupby_bins]
in_ = {s/1000000: (arr[:s], ) for s in sizes}
t = benchit.timings(fns, in_, multivar=True, input_name='Millions of events')
t.plot(logx=True, figsize=(12, 6), fontsize=14)

enter image description here

No doubt numba-powered bincount is a new winner of datasets that contains small IDs. It helps to reduce grouping of sorted data ~5 times which is ~10% of total runtime.

mathfux
  • 5,759
  • 1
  • 14
  • 34
1

I used np.unique() followed by np.extract()

unique = np.unique(a[:, 0:1])
answer = []
for element in unique:
    present = a[:,0]==element
    answer.append(np.extract(present,a[:,-1]))
print (answer)

[array([275, 441, 494, 593]), array([679, 533, 686]), array([559, 219, 455]), array([605, 468, 692, 613])]

user2251346
  • 620
  • 5
  • 14
1

given X as array of items you want to be grouped and y (1D array) as corresponding groups, following function does the grouping with numpy:

def groupby(X, y):
    y = np.asarray(y)
    X = np.asarray(X)
    y_uniques = np.unique(y)
    return [X[y==yi] for yi in y_uniques]

So, groupby(a[:,1], a[:,0]) returns [array([275, 441, 494, 593]), array([679, 533, 686]), array([559, 219, 455]), array([605, 468, 692, 613])]

afruzan
  • 1,454
  • 19
  • 20
1

We might also find it useful to generate a dict:

def groupby(X): 
    X = np.asarray(X) 
    x_uniques = np.unique(X) 
    return {xi:X[X==xi] for xi in x_uniques} 

Let's try it out:

X=[1,1,2,2,3,3,3,3,4,5,6,7,7,8,9,9,1,1,1]
groupby(X)                                                                                                      
Out[9]: 
{1: array([1, 1, 1, 1, 1]),
 2: array([2, 2]),
 3: array([3, 3, 3, 3]),
 4: array([4]),
 5: array([5]),
 6: array([6]),
 7: array([7, 7]),
 8: array([8]),
 9: array([9, 9])}

Note this by itself is not super compelling - but if we make X an object or namedtuple and then provide a groupby function it becomes more interesting. Will put that in later.

WestCoastProjects
  • 58,982
  • 91
  • 316
  • 560
  • 1
    When you work with numpy, going back to python dicts is in general a huge decrease in speed. If you work with larger arrays, stick with numpy functionality. – Michael Jan 30 '21 at 11:06
  • Sure - but often enough tasks are "small data" . If the tasks are bigger data than @vincentj 's answer - which i already upvoted and commented - works better. But it's not exactly tip of the tongue – WestCoastProjects Jan 30 '21 at 21:58
1

Late to the party, but anyways. If you plan to not only group the arrays, but also want to do operations on them like sum, mean and so on, and you're doing this with speed in mind, you also might want to consider numpy_groupies. All those group operations are optimized and jitted with numba. They easily outperform the other mentioned solutions.

from numpy_groupies.aggregate_numpy import aggregate
aggregate(a[:,0], a[:,1], "array", fill_value=[])
>>> array([array([], dtype=int64), array([275, 441, 494, 593]),
           array([679, 533, 686]), array([559, 219, 455]),
           array([605, 468, 692, 613])], dtype=object)
aggregate(a[:,0], a[:,1], "sum")
>>> array([   0, 1803, 1898, 1233, 2378])

Michael
  • 7,316
  • 1
  • 37
  • 63
1

Another approach suggested by Ashwini Chaudhary may be what you are looking for. Putting it in a simple function

def np_groupby(x, index):
    return np.split(x, np.where(np.diff(x[:,index]))[0]+1)

x = numpy array

index = column index

[0] + 1 according to Ashwini, ...any thing non-zero means that the item next to it was different, we can use numpy.where to find the indices of non-zero items and then add 1 to it because the actual index of such item is one more than the returned index; ...numpy.diff is used to find out where the items actually change.

GSA
  • 751
  • 8
  • 12