So I wrote a function that performs the same thing as describe
except taking a weight argument. I tested it on the small dataframe you provided, but haven't gone into too much detail. I tried not to use .apply
in case you have a large dataframe, though I didn't run a bench mark to see if my approach would be faster/less memory intensive than writing a function to do a weighted describe for each by
group and then using apply
to apply that to each by
group in the dataframe. That would probably be easiest.
Counts, min and max can be taken without regard to weighting. Then I did simple weighted mean and std. deviation--from formula for unbiased variance. I included an option for frequency weighting, which should just effect the sample size used to adjust the variance to the unbiased estimator. Frequency weights should use the sum of the weights as the sample size, otherwise, uses the count in the data. I used this answer to help get weighted percentiles.
import pandas as pd
import numpy as np
df = pd.DataFrame({'year': [2016,2016,2020, 2020],
'age': [41,65, 35,28], 'wt':[ 1.2, 0.7,0.8,1.5]})
df
year age wt
0 2016 41 1.2
1 2016 65 0.7
2 2020 35 0.8
3 2020 28 1.5
Then I define the function below.
def weighted_groupby_describe(df, col, by, wt, frequency=False):
'''
df : dataframe
col : column for which you want statistics, must be single column
by : groupby column(s)
wt : column to use for weights
frequency : if True, use sample size as sum of weights (only effects degrees
of freedom correction for unbiased variance)
'''
if isinstance(by, list):
df = df.sort_values(by+[col])
else:
df = df.sort_values([by] + [col])
newcols = ['gb_weights', 'col_weighted', 'col_mean',
'col_sqdiff', 'col_sqdiff_weighted', 'gb_weights_cumsum', 'ngroup']
assert all([c not in df.columns for c in newcols])
df['gb_weights'] = df[wt]/df.groupby(by)[wt].transform('sum')
df['gb_weights_cumsum'] = df.groupby(by)['gb_weights'].cumsum()
df['col_weighted'] = df.eval('{}*gb_weights'.format(col))
df['col_mean'] = df.groupby(by)['col_weighted'].transform('sum')
df['col_sqdiff'] = df.eval('({}-col_mean)**2'.format(col))
df['col_sqdiff_weighted'] = df.eval('col_sqdiff*gb_weights')
wstd = df.groupby(by)['col_sqdiff_weighted'].sum()**(0.5)
wstd.name = 'std'
wmean = df.groupby(by)['col_weighted'].sum()
wmean.name = 'mean'
df['ngroup'] = df.groupby(by).ngroup()
quantiles = np.array([0.25, 0.5, 0.75])
weighted_quantiles = df['gb_weights_cumsum'] - 0.5*df['gb_weights'] + df['ngroup']
ngroups = df['ngroup'].max()+1
x = np.hstack([quantiles+i for i in range(ngroups)])
quantvals = np.interp(x, weighted_quantiles, df[col])
quantvals = np.reshape(quantvals, (ngroups, -1))
other = df.groupby(by)[col].agg(['min', 'max', 'count'])
stats = pd.concat([wmean, wstd, other], axis=1, sort=False)
stats['25%'] = quantvals[:, 0]
stats['50%'] = quantvals[:, 1]
stats['75%'] = quantvals[:, 2]
colorder = ['count', 'mean', 'std', 'min', '25%', '50%', '75%', 'max']
stats = stats[colorder]
if frequency:
sizes = df.groupby(by)[wt].sum()
else:
sizes = stats['count']
stats['weight'] = sizes
# use the "sample size" (weight) to obtain std. deviation from unbiased
# variance
stats['std'] = stats.eval('((std**2)*(weight/(weight-1)))**(1/2)')
return stats
And test it out.
weighted_groupby_describe(df, 'age', 'year', 'wt')
count mean std min ... 50% 75% max weight
year ...
2016 2 49.842105 16.372398 41 ... 49.842105 61.842105 65 2
2020 2 30.434783 4.714936 28 ... 30.434783 33.934783 35 2
Compare this to the output without the weights.
df.groupby('year')['age'].describe()
count mean std min 25% 50% 75% max
year
2016 2.0 53.0 16.970563 41.0 47.00 53.0 59.00 65.0
2020 2.0 31.5 4.949747 28.0 29.75 31.5 33.25 35.0