How to speed up the agg of pandas groupby bins?

Question

I have created different bins for each column and grouped the DataFrame based on these.

import pandas as pd
import numpy as np

np.random.seed(100)
df = pd.DataFrame(np.random.randn(100, 4), columns=['a', 'b', 'c', 'value'])

# for simplicity, I use the same bin here
bins = np.arange(-3, 4, 0.05)

df['a_bins'] = pd.cut(df['a'], bins=bins)
df['b_bins'] = pd.cut(df['b'], bins=bins)
df['c_bins'] = pd.cut(df['c'], bins=bins)

The output of df.groupby(['a_bins','b_bins','c_bins']).size() indicates the group length is 2685619.

Calculate statistics of each group

Then, the statistics of each group are calculated like this:

%%timeit
df.groupby(['a_bins','b_bins','c_bins']).agg({'value':['mean']})

>>> 16.9 s ± 637 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Expected output

Is it possible to speed this up?
The quicker method should also support finding the value by inputs of a, b, and c values, like this:

df.groupby(['a_bins','b_bins','c_bins']).agg({'value':['mean']}).loc[(-1.72, 0.32, 1.18)]

>>> -0.252436

Kindly create a sample dataframe with expected output , so we are sure our results match and we r on d right track — sammywemmy, Dec 22 '21 at 21:40
@sammywemmy Thanks for the suggestion. `np.random.seed()` can make sure we have the same DataFrame. I updated the expected outputs now. — zxdawn, Dec 22 '21 at 21:56

sammywemmy · Accepted Answer · 2021-12-22T23:22:04.480

For this data, I'd suggest you pivot the data, and pass the mean. Usually, this is faster since you are hitting the entire dataframe, instead of going through each group:

(df
 .pivot(None, ['a_bins', 'b_bins', 'c_bins'], 'value')
 .mean()
 .sort_index() # ignore this if you are not fuzzy on order
)

a_bins         b_bins         c_bins       
(-2.15, -2.1]  (0.25, 0.3]    (-1.3, -1.25]    0.929100
               (0.75, 0.8]    (-0.3, -0.25]    0.480411
(-2.05, -2.0]  (-0.1, -0.05]  (0.3, 0.35]     -1.684900
               (0.75, 0.8]    (-0.25, -0.2]   -1.184411
(-2.0, -1.95]  (-0.6, -0.55]  (-1.2, -1.15]   -0.021176
                                                 ...   
(1.7, 1.75]    (-0.75, -0.7]  (1.05, 1.1]     -0.229518
(1.85, 1.9]    (-0.4, -0.35]  (1.8, 1.85]      0.003017
(1.9, 1.95]    (-1.45, -1.4]  (0.1, 0.15]      0.949361
(2.05, 2.1]    (-0.35, -0.3]  (-0.65, -0.6]    0.763184
(2.25, 2.3]    (-0.95, -0.9]  (0.1, 0.15]      2.539432

This matches the output from the groupby:

(df
 .groupby(['a_bins','b_bins','c_bins'])
 .agg({'value':['mean']})
 .dropna()
 .squeeze()
)

a_bins         b_bins         c_bins       
(-2.15, -2.1]  (0.25, 0.3]    (-1.3, -1.25]    0.929100
               (0.75, 0.8]    (-0.3, -0.25]    0.480411
(-2.05, -2.0]  (-0.1, -0.05]  (0.3, 0.35]     -1.684900
               (0.75, 0.8]    (-0.25, -0.2]   -1.184411
(-2.0, -1.95]  (-0.6, -0.55]  (-1.2, -1.15]   -0.021176
                                                 ...   
(1.7, 1.75]    (-0.75, -0.7]  (1.05, 1.1]     -0.229518
(1.85, 1.9]    (-0.4, -0.35]  (1.8, 1.85]      0.003017
(1.9, 1.95]    (-1.45, -1.4]  (0.1, 0.15]      0.949361
(2.05, 2.1]    (-0.35, -0.3]  (-0.65, -0.6]    0.763184
(2.25, 2.3]    (-0.95, -0.9]  (0.1, 0.15]      2.539432
Name: (value, mean), Length: 100, dtype: float64

The pivot option gives a speed of 3.72ms on my PC, while I had to terminate the groupby option, as it was taking too long (my PC is quite old :))

Again, the reason why this works/is faster is because the mean is hitting the entire dataframe, and not going through groups in the groupby.

As to your other question, you can index it easily:


bin_mean = (df
 .pivot(None, ['a_bins', 'b_bins', 'c_bins'], 'value')
 .mean()
 .sort_index() # ignore this if you are not fuzzy on order
)

bin_mean.loc[(-1.72, 0.32, 1.18)]
 -0.25243603652138985

The main problem though is Pandas for categoricals will return for all rows( which is wasteful, and not efficient); pass observed = True and you should notice a dramatic improvement:

(df.groupby(['a_bins','b_bins','c_bins'], observed=True)
   .agg({'value':['mean']})
)

                                              value
                                               mean
a_bins        b_bins        c_bins                 
(-2.15, -2.1] (0.25, 0.3]   (-1.3, -1.25]  0.929100
              (0.75, 0.8]   (-0.3, -0.25]  0.480411
(-2.05, -2.0] (-0.1, -0.05] (0.3, 0.35]   -1.684900
              (0.75, 0.8]   (-0.25, -0.2] -1.184411
(-2.0, -1.95] (-0.6, -0.55] (-1.2, -1.15] -0.021176
...                                             ...
(1.7, 1.75]   (-0.75, -0.7] (1.05, 1.1]   -0.229518
(1.85, 1.9]   (-0.4, -0.35] (1.8, 1.85]    0.003017
(1.9, 1.95]   (-1.45, -1.4] (0.1, 0.15]    0.949361
(2.05, 2.1]   (-0.35, -0.3] (-0.65, -0.6]  0.763184
(2.25, 2.3]   (-0.95, -0.9] (0.1, 0.15]    2.539432

Speed is about 7.39ms on my PC, about 2 times less than the pivot option, but way faster now, and that's because only categoricals that exist in the dataframe are used/returned.

Nice example of `pivot`! But, when I increase the data to 100000 rows, it would raise `Unable to allocate 65.2 GiB for an array with shape (100000, 87554) and data type float64`. @Nikita Almakov method still works well. — zxdawn, Dec 23 '21 at 09:01
Hmmmm…. That is a lot of memory. Same issue when you run the groupby with observed = True? @NikitaAlmakov’s library is awesome. — sammywemmy, Dec 23 '21 at 09:59
Ha, I only tested `pivot`. `observed=True` works well and it is faster than @NikitaAlmakov's method! convtools: `777 ms ± 30.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)`, observed=True: `32.1 ms ± 592 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)`. — zxdawn, Dec 23 '21 at 10:07
Note that for the data of `1000` length, they're similar. `observed=True`: `7.06 ms ± 373 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)` and `convtools`: `7.32 ms ± 667 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)`. It's hard to pick which one is the answer, hah. Maybe you can make a comparison figure of both methods for different length data? Then, I have no doubt to accept your answer. — zxdawn, Dec 23 '21 at 10:11
@XinZhang this was a pandas-related question, sammywemmy answered it, while I just shared an alternative option, which may be helpful if some stream processing is needed and input data doesn't fit into memory. I'd vote for accepting sammy's one :) — westandskif, Dec 23 '21 at 10:14
Throw a coin in the air. :) the thrill for me is the learning opportunities… points r a by product. Whichever you pick is fine. — sammywemmy, Dec 23 '21 at 10:19
@NikitaAlmakov Got it! BTW, @sammywemmy could you check the time info in your answer? It seems `7.39ms` is for `pivot` and `3.72ms` for `observed=True`. — zxdawn, Dec 23 '21 at 10:20
@XinZhang, just tested again, pivot(3.77ms) is faster than the groupby(7.14ms) — sammywemmy, Dec 23 '21 at 11:05

westandskif · Answer 2 · 2021-12-23T10:16:39.597

An alternative straightforward solution, based on convtools, which is able to process input stream of data and doesn't require input data to fit into memory:

import numpy as np
import pandas as pd

from convtools import conversion as c


def c_bin(left, right, bin_size):
    return c.if_(
        c.or_(c.this < left, c.this > right),
        None,
        ((c.this - left) // bin_size).pipe(
            (c.this * bin_size + left, (c.this + 1) * bin_size + left)
        ),
    )


to_binned = c_bin(-3, 4, 0.05)
to_interval = c.if_(c.this, c.apply_func(pd.Interval, c.this, {}), None)

a_bins = c.item(0).pipe(to_binned)
b_bins = c.item(1).pipe(to_binned)
c_bins = c.item(2).pipe(to_binned)
converter = (
    c.group_by(a_bins, b_bins, c_bins)
    .aggregate(
        {
            "a_bins": a_bins.pipe(to_interval),
            "b_bins": b_bins.pipe(to_interval),
            "c_bins": c_bins.pipe(to_interval),
            "value_mean": c.ReduceFuncs.Average(c.item(3)),
        }
    )
    .gen_converter()
)


np.random.seed(100)
data = np.random.randn(100, 4)

df = pd.DataFrame(converter(data)).set_index(["a_bins", "b_bins", "c_bins"])
df.loc[(-1.72, 0.32, 1.18)]

Timings:

In [44]: %timeit converter(data)
438 µs ± 1.59 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

# passing back to pandas, timing the end-to-end thing:
In [43]: %timeit pd.DataFrame(converter(data)).set_index(["a_bins", "b_bins", "c_bins"]).loc[(-1.72, 0.32, 1.18)]
2.37 ms ± 14.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

JFYI: Shortened output of converter(data):

[
 ...,
 {'a_bins': Interval(-0.44999999999999973, -0.3999999999999999, closed='right'),
  'b_bins': Interval(0.7000000000000002, 0.75, closed='right'),
  'c_bins': Interval(-0.19999999999999973, -0.1499999999999999, closed='right'),
  'value_mean': -0.08605564337254189},
 {'a_bins': Interval(-0.34999999999999964, -0.2999999999999998, closed='right'),
  'b_bins': Interval(-0.1499999999999999, -0.09999999999999964, closed='right'),
  'c_bins': Interval(0.050000000000000266, 0.10000000000000009, closed='right'),
  'value_mean': 0.18971879197958597},
 {'a_bins': Interval(-2.05, -2.0, closed='right'),
  'b_bins': Interval(0.75, 0.8000000000000003, closed='right'),
  'c_bins': Interval(-0.25, -0.19999999999999973, closed='right'),
  'value_mean': -1.1844114274105708}]

Thanks for the amazing tool. It's really fast! Is it possible to meet the second requirement of `Expected output`? BTW, could you explain why this is much faster than @sammywemmy method? — zxdawn, Dec 23 '21 at 08:55
@XinZhang Hope this will be a decent complement to polars/pandas in your toolkit! :) I've updated the above to meet the 2nd requirement (missed this part, sorry for this). Regarding the speed, it's nowhere near as fast as polars/pandas, which perform on lower level & use vectorizations. However convtools is built to generate simple fast raw python ad hoc code to solve problems by dynamic code generation, sometimes it helps :) And also it may improve code-reuse a lot! — westandskif, Dec 23 '21 at 09:41

SultanOrazbayev · Answer 3 · 2021-12-23T03:44:16.037

This is a good use-case for scipy.stats.binned_statistic_dd. The snippet below computes mean statistic only, but many other statistics are supported (see docs linked above):

import numpy as np
import pandas as pd

np.random.seed(100)
df = pd.DataFrame(np.random.randn(100, 4), columns=["a", "b", "c", "value"])

# for simplicity, I use the same bin here
bins = np.arange(-3, 4, 0.05)

df["a_bins"] = pd.cut(df["a"], bins=bins)
df["b_bins"] = pd.cut(df["b"], bins=bins)
df["c_bins"] = pd.cut(df["c"], bins=bins)

# this takes about 35 seconds
result_pandas = df.groupby(["a_bins", "b_bins", "c_bins"]).agg({"value": ["mean"]})

from scipy.stats import binned_statistic_dd

# this takes about 20 ms
result_scipy = binned_statistic_dd(
    df[["a", "b", "c"]].to_numpy(), df["value"], bins=(bins, bins, bins)
)

# this is a verbose way to get a dataframe representation
# for many purposes this probably will not be needed
# takes about 5 seconds
temp_list = []
for na, a in enumerate(result_scipy[1][0][:-1]):
    for nb, b in enumerate(result_scipy[1][1][:-1]):
        for nc, c in enumerate(result_scipy[1][2][:-1]):
            value = result_scipy[0][na, nb, nc]
            temp_list.append([a, b, c, value])

result_scipy_as_df = pd.DataFrame(temp_list, columns=list("abcx"))

# check that the result is the same
result_scipy_as_df["x"].describe() == result_pandas["value"]["mean"].describe()

If you are interested in speeding up this further, this answer might be useful.

An important caveat is that binned_statistic_dd uses bins that are closed on the right, e.g. [0,1), except for the last one (refer to the Notes in the linked docs), so for consistent bin identifiers one would have to use right=False in pd.cut.

Here's a look-up example, note that here the exact bin edge location is increased by 1 to get similar result as in pandas:

aloc, bloc, cloc = -2.12, 0.23, -1.25
print(result_pandas.loc[(aloc, bloc, cloc)])
print(result_scipy.statistic[
    np.digitize(aloc, result_scipy.bin_edges[0][1:]),
    np.digitize(bloc, result_scipy.bin_edges[1][1:]),
    np.digitize(cloc, result_scipy.bin_edges[2][1:]),
])

Oh, I realize that @sammywemmy 's method will drop NaN value. If users need the NaN value, then your answer is quite useful! Thanks a lot ;) — zxdawn, Dec 28 '21 at 09:18
Note that `np.digitize()` should add `right=True`, otherwise, the maximum value is out of the index. — zxdawn, Dec 28 '21 at 20:45

score 0 · Answer 4 · answered Dec 22 '21 at 16:39

0

Because your bins are the same for your 3 columns, use codes from cat accessor:

%timeit df.groupby([df['a_bins'].cat.codes, df['b_bins'].cat.codes, df['c_bins'].cat.codes])['value'].mean()
1.82 ms ± 27.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

answered Dec 22 '21 at 16:39

Corralien

109,409
8
28
52

It's a bit slower with `agg`. – Corralien Dec 22 '21 at 16:42
They are not same in my real case. The example above is for simplicity as mentioned in the code comment. – zxdawn Dec 22 '21 at 17:21

How to speed up the agg of pandas groupby bins?

Calculate statistics of each group

Expected output

4 Answers4