Pandas 're-binning' a DataFrame

Question

A DataFrame containing data with age binned in separate rows, as below:

 VALUE,AGE
 10,   0-4
 20,   5-9
 30,  10-14
 40,  15-19
 .. ..   .....

So, basically, the age is grouped in 5 year bins. I'd like to have 10 year bins, that is, 0-9,10-19 etc. What I'm after is the VALUE, but for 10-year based age bins, so the values would be:

VALUE,AGE
30,  0-9
70, 10-19

I can do it by shifting and adding, and taking every second row of the resulting dataframe, but is there any smart, more general way built into Pandas to do this ?

Are the bins strings? Or actual intervals? Could any of the new bins be ambiguous, for example, would you ever want to bin based on 7 year bins. — user3483203, Jun 26 '20 at 20:10
The bins are strings. The reason I want to rebin is that I have another df, coming from a different source, that has the ages grouped into 10 year bins, while the first df (also pulled down from the net) uses 5 year bins. I want to compare the age based data (the 'value' of the example) with some corresponding value from the other dataframe, so the bins must be identical. In this particular case I'm lucky, since the second df's age grouping is a multiple of the first, so I can pull it off by shift(), but in general, like you suggest, how would one do it if the binnings were not multiples...? — Pryderide, Jun 26 '20 at 20:19

Tom · Answer 1 · 2020-06-26T20:58:49.543

Here's a "dumb" version, based on this answer - just sum every 2 rows:

In[0]
df.groupby(df.index // 2).sum()

Out[0]:
   VALUE
0     30
1     70

I say "dumb" because this method doesn't factor in the age cut offs, it just happens to align with them. So say if the age ranges are variable, or if you have data that start at 5-9 instead of 0-4, this will likely cause an issue. You also have to rename the index as it is unclear.

A "smarter" version would be to actually create bins with pd.cut and use that to group the data, based on the ages for each row:

In[0]
df['MAX_AGE'] = df['AGE'].str.split('-').str[-1].astype(int)

bins = [0,10,20]
out = df.groupby(pd.cut(df['MAX_AGE'], bins=bins, right=False)).sum().drop('MAX_AGE',axis=1)

Out[0]:
          VALUE
    AGE        
(0, 10]      30
(10, 20]     70

Explanation:

Use pandas.Series.str methods to get out the maximum age for each row, store in a column "MAX_AGE"
Create bins at 10 year cut offs
Use pd.cut to assign the data into bins based on the max age of each row. Then use groupby on these bins and sum. Note that since we specify right = False, the bins depicted in the index should mean 0-9 and 10-19.

For reference, here is the data I was using:

import pandas as pd
    
VALUE = [10,20,30,40,]
AGE = ['0-4','5-9','10-14','15-19']

df = pd.DataFrame({'VALUE':VALUE,
                   'AGE':AGE})

I like the cut-based solution, it was something similar I thought might do the trick, but failed to figure out. And I like your very clear and detailed commentry. — Pryderide, Jun 27 '20 at 10:56
I don't really care about the precision and I need to do something similar on a lot of data is the `df.groupby(df.index // 2).sum()` fast? Testing it now — Czarking, Jun 18 '21 at 17:48

score 0 · Answer 2 · answered Jun 26 '20 at 20:49

This should work as long as they are all in 5 year increments. This will find where the upper number is uneven and group it with what came before, stopping at the last uneven number.

Below splits the string to get the numerical value

df['lower'] = df['AGE'].str.split('-').str[0]
df['upper'] = df['AGE'].str.split('-').str[1]
df[['lower','upper']] = df[['lower','upper']].astype(int)

Then it will apply the grouping logic, and rename the columns to represent the desired time period.

df['VALUE'] = df.groupby((df['upper'] % 2 == 1).shift().fillna(0).cumsum())['VALUE'].transform('sum')
df = df.drop_duplicates(subset = ['VALUE'],keep = 'last')
df['lower'] = df['lower'] - 5
df[['lower','upper']] = df[['lower','upper']].astype(str)
df['AGE'] = df['lower'] + '-' + df['upper']
df = df.drop(columns = ['lower','upper'])

Pandas 're-binning' a DataFrame

2 Answers2