Pandas Group By With Overlapping Bins

Question

I want to sum up data across overlapping bins. Basically the question here but instead of the bins being (0-8 years old), (9 - 17 years old), (18-26 years old), (27-35 years old), and (26 - 44 years old) I want them to be (0-8 years old), (1 - 9 years old), (2-10 years old), (3-11 years old), and (4 - 12 years old).

Starting with a df like this

id	awards	age
1	100	24
1	150	26
1	50	54
2	193	34
2	209	50

I am using the code from this answer to calculate summation across non-overlapping bins.

bins = [9 * i for i in range(0, df['age'].max() // 9 + 2)]
cuts = pd.cut(df['age'], bins, right=False)

print(cuts)

0    [18, 27)
1    [18, 27)
2    [54, 63)
3    [27, 36)
4    [45, 54)
Name: age, dtype: category
Categories (7, interval[int64, left]): [[0, 9) < [9, 18) < [18, 27) < [27, 36) < [36, 45) < [45, 54) < [54, 63)]

df_out = (df.groupby(['id', cuts])
            .agg(total_awards=('awards', 'sum'))
            .reset_index(level=0)
            .reset_index(drop=True)
         )
df_out['age_interval'] = df_out.groupby('id').cumcount()

Result

print(df_out)

    id  total_awards  age_interval
0    1             0             0
1    1             0             1
2    1           250             2
3    1             0             3
4    1             0             4
5    1             0             5
6    1            50             6
7    2             0             0
8    2             0             1
9    2             0             2
10   2           193             3
11   2             0             4
12   2           209             5
13   2             0             6

Is it possible to work off the existing code to do this with overlapping bins?

please mention your desire output and add working code snippet so that we can reproduce the issue — Sabil, Sep 17 '21 at 18:02

Cimbali · Answer 1 · 2021-09-20T23:11:06.563

I think the best would be to first compute per-age sums, and then a rolling window to get all 9 year intervals. This only works because all your intervals have the same size − otherwise it would be much harder.

>>> totals = df.groupby('age')['awards'].sum()
>>> totals = totals.reindex(np.arange(0, df['age'].max() + 9)).fillna(0, downcast='infer')
>>> totals
0      6
1      2
2      4
3      6
4      4
      ..
98     0
99     0
100    0
101    0
102    0
Name: age, Length: 103, dtype: int64
>>> totals.rolling(9).sum().dropna().astype(int).rename(lambda age: f'{age-8}-{age}')
0-8       42
1-9       43
2-10      45
3-11      47
4-12      47
          ..
90-98     31
91-99     27
92-100    20
93-101    13
94-102     8
Name: age, Length: 95, dtype: int64

This is slightly complicated by the fact you also want to group by id, but the idea stays the same:

>>> idx = pd.MultiIndex.from_product([df['id'].unique(), np.arange(0, df['age'].max() + 9)], names=['id', 'age'])
>>> totals = df.groupby(['id', 'age']).sum().reindex(idx).fillna(0, downcast='infer')
>>> totals
       awards
1 0       128
  1       204
  2       136
  3       367
  4       387
...       ...
2 98        0
  99        0
  100       0
  101       0
  102       0

[206 rows x 1 columns]
>>> totals.groupby('id').rolling(9).sum().droplevel(0).dropna().astype(int).reset_index('id')
     id  awards
age            
8     1    3112
9     1    3390
10    1    3431
11    1    3609
12    1    3820
..   ..     ...
98    2    1786
99    2    1226
100   2     900
101   2     561
102   2     317

[190 rows x 2 columns]

This is the same as @Ben.T’s answer except we keep the Series shape and his answer pivots it to a dataframe. At any step you could .stack('age') or .unstack('age') to switch between both answer’s formats.

score 2 · Accepted Answer · answered Sep 20 '21 at 15:24

First pivot_table your data to get a row per id and the columns being the ages. then reindex to get all the ages possible, from 0 to at least the max in the column age (here I use the max plus the interval length). Now you can use rolling along the columns. Rename the columns to create meaningful names. Finally stack and reset_index to get a dataframe with the expected shape.

interval = 9 #include both bounds like 0 and 8 for the first interval
res = (
    df.pivot_table(index='id', columns='age', values='awards', 
                   aggfunc=sum, fill_value=0)
      .reindex(columns=range(0, df['age'].max()+interval), fill_value=0)
      .rolling(interval, axis=1, min_periods=interval).sum()
      .rename(columns=lambda x: f'{x-interval+1}-{x} y.o.')
      .stack()
      .reset_index(name='awards')
)

and you get with the input data provided in the question

print(res)
#      id         age  awards
# 0     1    0-8 y.o.     0.0
# 1     1    1-9 y.o.     0.0
# ...
# 15    1  15-23 y.o.     0.0
# 16    1  16-24 y.o.   100.0
# 17    1  17-25 y.o.   100.0
# 18    1  18-26 y.o.   250.0
# 19    1  19-27 y.o.   250.0
# 20    1  20-28 y.o.   250.0
# 21    1  21-29 y.o.   250.0
# 22    1  22-30 y.o.   250.0
# 23    1  23-31 y.o.   250.0
# 24    1  24-32 y.o.   250.0
# 25    1  25-33 y.o.   150.0
# 26    1  26-34 y.o.   150.0
# 27    1  27-35 y.o.     0.0
# ...
# 45    1  45-53 y.o.     0.0
# 46    1  46-54 y.o.    50.0
# 47    1  47-55 y.o.    50.0
# 48    1  48-56 y.o.    50.0
# 49    1  49-57 y.o.    50.0
# ...

Alonso Ogueda Oliva · Answer 3 · 2021-09-20T15:47:03.457

1

A old way without pd.cut using a for loop and some masks.

import pandas as pd
max_age = df["age"].max()
interval_length = 8
values = []
for min_age in range(max_age - interval_length + 1):
    max_age = min_age + interval_length
    awards = df.query("@min_age <= age <= @max_age").loc[:, "age"].sum()
    values.append([min_age, max_age, awards])

df_out = pd.DataFrame(values, columns=["min_age", "max_age", "awards"])

Let me know if this is what you want :)

edited Sep 20 '21 at 15:47

answered Sep 19 '21 at 23:41

Alonso Ogueda Oliva

191
6

Thanks! I'm hoping to avoid for loops to make it faster! – user1114 Sep 20 '21 at 00:15
Got it, that makes sense to me too. If you have a lot of ages you could do some parallelization on this code, since each step is independent of the others. – Alonso Ogueda Oliva Sep 20 '21 at 15:48

score 1 · Answer 4 · answered Sep 20 '21 at 01:28

IIUC, you can use pd.IntervalIndex with some list comprehension:

ii = pd.IntervalIndex.from_tuples(
    [
        (s, e)
        for e, s in pd.Series(np.arange(51)).rolling(9).agg(min).dropna().iteritems()
    ]
)

df_out = pd.concat(
    [
        pd.Series(ii.contains(x["age"]) * x["awards"], index=ii)
        for i, x in df[["age", "awards"]].iterrows()
    ],
    axis=1,
).groupby(level=0).sum().T

df_out.stack()

Output:

0  (0.0, 8.0]        0
   (1.0, 9.0]        0
   (2.0, 10.0]       0
   (3.0, 11.0]       0
   (4.0, 12.0]       0
                  ... 
4  (38.0, 46.0]      0
   (39.0, 47.0]      0
   (40.0, 48.0]      0
   (41.0, 49.0]      0
   (42.0, 50.0]    209
Length: 215, dtype: int64

score 0 · Answer 5 · answered Sep 23 '21 at 19:46

Let df be a DataFrame:

import pandas as pd
import random

def r(b, e):
    return [random.randint(b, e) for _ in range(300)]

df = pd.DataFrame({'id': r(1, 3), 'awards': r(0, 400), 'age': r(1, 99)})

For binning by age, I would advise creating a new column since it is clearer (and faster):

df['bin'] = df['age'].apply(lambda x: x // 9)
print(df)

The number of awards per id per bin can be obtained using simply:

totals_separate = df.groupby(['id', 'bin'])['awards'].sum()
print(totals_separate)

If I understand correctly, you would like the sum for each window of size 9 rows:

totals_rolling = df.groupby(['id', 'bin'])['awards'].rolling(9, min_periods=1).sum()
print(totals_rolling)

Reference: https://pandas.pydata.org/docs/reference/api/pandas.Series.rolling.html

Pandas Group By With Overlapping Bins

5 Answers5