17

Pandas has a widely-used groupby facility to split up a DataFrame based on a corresponding mapping, from which you can apply a calculation on each subgroup and recombine the results.

Can this be done flexibly in NumPy without a native Python for-loop? With a Python loop, this would look like:

>>> import numpy as np

>>> X = np.arange(10).reshape(5, 2)
>>> groups = np.array([0, 0, 0, 1, 1])

# Split up elements (rows) of `X` based on their element wise group
>>> np.array([X[groups==i].sum() for i in np.unique(groups)])
array([15, 30])

Above 15 is the sum of the first three rows of X, and 30 is the sum of the remaining two.

By "flexibly,” I just mean that we aren't focusing on one particular computation such as sum, count, maximum, etc, but rather passing any computation to the grouped arrays.

If not, is there a faster approach than the above?

Brad Solomon
  • 38,521
  • 31
  • 149
  • 235
  • Note, `itertools.groupby` doesn't function like `pandas.Dataframe.groupby`, unless the iterable you pass to `itertools.groupby` is sorted by the grouping key... – juanpa.arrivillaga Mar 07 '18 at 00:06
  • @juanpa.arrivillaga Right ... and the `key` parameter for itertools needs to be a callable (not an array of groups). And you couldn't use something like `groups.__getitem__` because it needs to be a function that gets applied to each element of the first arg – Brad Solomon Mar 07 '18 at 00:09
  • Do you have any idea of how pandas implements the groupby? Does it use Python structures (dictionary? itertools groupby?), C or Cython code? The itertools version groups contiguous groups, runs. So the input (key) has to be sorted. – hpaulj Mar 07 '18 at 02:44
  • It's a bit like the binning problem here: [How to get a list of indexes selected by a specific value efficiently with numpy arrays?](https://stackoverflow.com/questions/48686381). I've argued that it is hard to get a true 'vectorized' solution because the bins (or groups) differ in size. Where you just want one index, @Divikar has given a `searchsorted` solution, [Retrieve indexes of multiple values with Numpy in a vectorization way](https://stackoverflow.com/questions/49067127) – hpaulj Mar 07 '18 at 02:51
  • @hpaulj the [python wrapper](https://github.com/pandas-dev/pandas/blob/master/pandas/core/groupby.py#L354) basically builds an iterator of (key, array) pairs. It does make [cython](https://github.com/pandas-dev/pandas/blob/master/pandas/_libs/groupby.pyx) calls for some of the actual computations – Brad Solomon Mar 07 '18 at 04:16
  • Is `groups` always an array of `uint`s? – Daniel F Mar 07 '18 at 06:41
  • Yes, always 1d unsigned uint8 to be specific @DanielF (But not guaranteed to be sorted) – Brad Solomon Mar 07 '18 at 06:42

5 Answers5

9

How about using scipy sparse matrix

import numpy as np
from scipy import sparse
import time

x_len = 500000
g_len = 100

X = np.arange(x_len * 2).reshape(x_len, 2)
groups = np.random.randint(0, g_len, x_len)

# original
s = time.time()

a = np.array([X[groups==i].sum() for i in np.unique(groups)])

print(time.time() - s)

# using scipy sparse matrix
s = time.time()

x_sum = X.sum(axis=1)
b = np.array(sparse.coo_matrix(
    (
        x_sum,
        (groups, np.arange(len(x_sum)))
    ),
    shape=(g_len, x_len)
).sum(axis=1)).ravel()

print(time.time() - s)

#compare
print(np.abs((a-b)).sum())

result on my PC

0.15915322303771973
0.012875080108642578
0

More than 10 times faster.


Update!

Let's benchmark answers of @Paul Panzer and @Daniel F. It is summation only benchmark.

import numpy as np
from scipy import sparse
import time

# by @Daniel F
def groupby_np(X, groups, axis = 0, uf = np.add, out = None, minlength = 0, identity = None):
    if minlength < groups.max() + 1:
        minlength = groups.max() + 1
    if identity is None:
        identity = uf.identity
    i = list(range(X.ndim))
    del i[axis]
    i = tuple(i)
    n = out is None
    if n:
        if identity is None:  # fallback to loops over 0-index for identity
            assert np.all(np.in1d(np.arange(minlength), groups)), "No valid identity for unassinged groups"
            s = [slice(None)] * X.ndim
            for i_ in i:
                s[i_] = 0
            out = np.array([uf.reduce(X[tuple(s)][groups == i]) for i in range(minlength)])
        else:
            out = np.full((minlength,), identity, dtype = X.dtype)
    uf.at(out, groups, uf.reduce(X, i))
    if n:
        return out

x_len = 500000
g_len = 200

X = np.arange(x_len * 2).reshape(x_len, 2)
groups = np.random.randint(0, g_len, x_len)

print("original")
s = time.time()

a = np.array([X[groups==i].sum() for i in np.unique(groups)])

print(time.time() - s)

print("use scipy coo matrix")
s = time.time()

x_sum = X.sum(axis=1)
b = np.array(sparse.coo_matrix(
    (
        x_sum,
        (groups, np.arange(len(x_sum)))
    ),
    shape=(g_len, x_len)
).sum(axis=1)).ravel()

print(time.time() - s)

#compare
print(np.abs((a-b)).sum())


print("use scipy csr matrix @Daniel F")
s = time.time()
x_sum = X.sum(axis=1)
c = np.array(sparse.csr_matrix(
    (
        x_sum,
        groups,
        np.arange(len(groups)+1)
    ),
    shape=(len(groups), g_len)
).sum(axis=0)).ravel()

print(time.time() - s)

#compare
print(np.abs((a-c)).sum())


print("use bincount @Paul Panzer @Daniel F")
s = time.time()
d = np.bincount(groups, X.sum(axis=1), g_len)
print(time.time() - s)

#compare
print(np.abs((a-d)).sum())

print("use ufunc @Daniel F")
s = time.time()
e = groupby_np(X, groups)
print(time.time() - s)

#compare
print(np.abs((a-e)).sum())

STDOUT

original
0.2882847785949707
use scipy coo matrix
0.012301445007324219
0
use scipy csr matrix @Daniel F
0.01046299934387207
0
use bincount @Paul Panzer @Daniel F
0.007468223571777344
0.0
use ufunc @Daniel F
0.04431319236755371
0

The winner is the bincount solution. But the csr matrix solution is also very interesting.

klim
  • 1,179
  • 8
  • 11
6

@klim's sparse matrix solution would at first sight appear to be tied to summation. We can, however, use it in the general case by converting between the csr and csc formats:

Let's look at a small example:

>>> m, n = 3, 8                                                                                                     
>>> idx = np.random.randint(0, m, (n,))
>>> data = np.arange(n)
>>>                                                                                                                 
>>> M = sparse.csr_matrix((data, idx, np.arange(n+1)), (n, m))                                                      
>>>                                                                                                                 
>>> idx                                                                                                             
array([0, 2, 2, 1, 1, 2, 2, 0])                                                                                     
>>> 
>>> M = M.tocsc()
>>> 
>>> M.indptr, M.indices
(array([0, 2, 4, 8], dtype=int32), array([0, 7, 3, 4, 1, 2, 5, 6], dtype=int32))

As we can see after conversion the internal representation of the sparse matrix yields the indices grouped and sorted:

>>> groups = np.split(M.indices, M.indptr[1:-1])
>>> groups
[array([0, 7], dtype=int32), array([3, 4], dtype=int32), array([1, 2, 5, 6], dtype=int32)]
>>> 

We could have obtained the same using a stable argsort:

>>> np.argsort(idx, kind='mergesort')
array([0, 7, 3, 4, 1, 2, 5, 6])
>>> 

But sparse matrices are actually faster, even when we allow argsort to use a faster non-stable algorithm:

>>> m, n = 1000, 100000
>>> idx = np.random.randint(0, m, (n,))
>>> data = np.arange(n)
>>> 
>>> timeit('sparse.csr_matrix((data, idx, np.arange(n+1)), (n, m)).tocsc()', **kwds)
2.250748165184632
>>> timeit('np.argsort(idx)', **kwds)
5.783584725111723

If we require argsort to keep groups sorted, the difference is even larger:

>>> timeit('np.argsort(idx, kind="mergesort")', **kwds)
10.507467685034499
Paul Panzer
  • 51,835
  • 3
  • 54
  • 99
  • This does constrain `data` to be 1d though, no? I.e. the trick with summing is that it reduced in the input dimensionality from 2 to 1 in this case. Please let me know if I'm wrong on that. In the case of something like zero-axis mean on X, it seems you're stuck with 2d – Brad Solomon Mar 07 '18 at 13:00
  • @BradSolomon If you want to have `data` in the sparse matrix `M`, yes. But that is not strictly necessary. One can keep `data` separately (and put some dummy for `M`s values) and then fancy index them with `M.indices`. In that case `data` can have more dimensions. – Paul Panzer Mar 07 '18 at 13:26
5

If you want a more flexible implementation of groupby that can group using any of numpy's ufuncs:

def groupby_np(X, groups, axis = 0, uf = np.add, out = None, minlength = 0, identity = None):
    if minlength < groups.max() + 1:
        minlength = groups.max() + 1
    if identity is None:
        identity = uf.identity
    i = list(range(X.ndim))
    del i[axis]
    i = tuple(i)
    n = out is None
    if n:
        if identity is None:  # fallback to loops over 0-index for identity
            assert np.all(np.in1d(np.arange(minlength), groups)), "No valid identity for unassinged groups"
            s = [slice(None)] * X.ndim
            for i_ in i:
                s[i_] = 0
            out = np.array([uf.reduce(X[tuple(s)][groups == i]) for i in range(minlength)])
        else:
            out = np.full((minlength,), identity, dtype = X.dtype)
    uf.at(out, groups, uf.reduce(X, i))
    if n:
        return out

groupby_np(X, groups)
array([15, 30])

groupby_np(X, groups, uf = np.multiply)
array([   0, 3024])

groupby_np(X, groups, uf = np.maximum)
array([5, 9])

groupby_np(X, groups, uf = np.minimum)
array([0, 6])
Daniel F
  • 13,620
  • 2
  • 29
  • 55
2

There's probably a faster way than this (both of the operands are making copies right now), but:

np.bincount(np.broadcast_to(groups, X.T.shape).ravel(), X.T.ravel())

array([ 15.,  30.])
Daniel F
  • 13,620
  • 2
  • 29
  • 55
0

If you want to extend the answer to a ndarray, and still have a fast computation, you could extend the Daniel's solution :

x_len = 500000
g_len = 200
y_len = 2

X = np.arange(x_len * y_len).reshape(x_len, y_len)
groups = np.random.randint(0, g_len, x_len)

# original
a = np.array([X[groups==i].sum(axis=0) for i in np.unique(groups)])

# alternative
bins = [0] + list(np.bincount(groups, minlength=g_len).cumsum())
Z = np.argsort(groups)
d = np.array([X.take(Z[bins[i]:bins[i+1]],0).sum(axis=0) for i in range(g_len)])

It took about 30 ms (15ms for creating bins + 15ms for summing) instead of 280 ms on the original way in this example.

d.shape
>>> (1000, 2)
frenco
  • 11
  • 2