103

Is there a way to write an aggregation function as is used in DataFrame.agg method, that would have access to more than one column of the data that is being aggregated? Typical use cases would be weighted average, weighted standard deviation funcs.

I would like to be able to write something like

def wAvg(c, w):
    return ((c * w).sum() / w.sum())

df = DataFrame(....) # df has columns c and w, i want weighted average
                     # of c using w as weight.
df.aggregate ({"c": wAvg}) # and somehow tell it to use w column as weights ...
fixxxer
  • 15,568
  • 15
  • 58
  • 76
user1444817
  • 1,131
  • 2
  • 8
  • 3
  • 1
    Nice article addressing this specific SO question: https://pbpython.com/weighted-average.html – ptim Sep 28 '20 at 08:41

9 Answers9

127

Yes; use the .apply(...) function, which will be called on each sub-DataFrame. For example:

grouped = df.groupby(keys)

def wavg(group):
    d = group['data']
    w = group['weights']
    return (d * w).sum() / w.sum()

grouped.apply(wavg)
piRSquared
  • 285,575
  • 57
  • 475
  • 624
Wes McKinney
  • 101,437
  • 32
  • 142
  • 108
  • 1
    It may be more efficient to break this up into a few operations as follows: (1) create a column of weights, (2) normalize the observations by their weights, (3) compute grouped sum of weighted observations and a grouped sum of weights, (4) normalize weighted sum of observations by the sum of weights. – kalu May 10 '14 at 15:28
  • 4
    What if we want to calculate wavg's of many variables (columns), e.g. everything except for df['weights'] ? – CPBL Oct 19 '14 at 18:00
  • 2
    @Wes, is there any way once could do this with `agg()` and a `lambda` built around `np.average(...weights=...)`, or any new native support in pandas for weighted means since this post first appeared? – sparc_spread Apr 24 '15 at 20:03
  • 4
    @Wes McKinney: In your book you suggest this approach: `get_wavg = lambda g: np.average(g['data'], weights = g['weights'])`; `grouped.apply(wavg)` Are the two interchangeable? – robroc Apr 16 '16 at 00:03
16

It is possible to return any number of aggregated values from a groupby object with apply. Simply, return a Series and the index values will become the new column names.

Let's see a quick example:

df = pd.DataFrame({'group':['a','a','b','b'],
                   'd1':[5,10,100,30],
                   'd2':[7,1,3,20],
                   'weights':[.2,.8, .4, .6]},
                 columns=['group', 'd1', 'd2', 'weights'])
df

  group   d1  d2  weights
0     a    5   7      0.2
1     a   10   1      0.8
2     b  100   3      0.4
3     b   30  20      0.6

Define a custom function that will be passed to apply. It implicitly accepts a DataFrame - meaning the data parameter is a DataFrame. Notice how it uses multiple columns, which is not possible with the agg groupby method:

def weighted_average(data):
    d = {}
    d['d1_wa'] = np.average(data['d1'], weights=data['weights'])
    d['d2_wa'] = np.average(data['d2'], weights=data['weights'])
    return pd.Series(d)

Call the groupby apply method with our custom function:

df.groupby('group').apply(weighted_average)

       d1_wa  d2_wa
group              
a        9.0    2.2
b       58.0   13.2

You can get better performance by precalculating the weighted totals into new DataFrame columns as explained in other answers and avoid using apply altogether.

Ted Petrou
  • 59,042
  • 19
  • 131
  • 136
9

My solution is similar to Nathaniel's solution, only it's for a single column and I don't deep-copy the entire data frame each time, which could be prohibitively slow. The performance gain over the solution groupby(...).apply(...) is about 100x(!)

def weighted_average(df, data_col, weight_col, by_col):
    df['_data_times_weight'] = df[data_col] * df[weight_col]
    df['_weight_where_notnull'] = df[weight_col] * pd.notnull(df[data_col])
    g = df.groupby(by_col)
    result = g['_data_times_weight'].sum() / g['_weight_where_notnull'].sum()
    del df['_data_times_weight'], df['_weight_where_notnull']
    return result
ErnestScribbler
  • 2,667
  • 1
  • 18
  • 13
  • Would be more readable if you used PEP8 consistently and remove the superfluous `del` line. – MERose Jul 27 '19 at 16:29
  • Thanks! The `del` line is actually not superfluous, since I change the input DataFrame in-place to improve performance, so I have to clean up. – ErnestScribbler Aug 11 '19 at 12:57
  • But you return the result in the very next line which ends the function. Once the function is finished, all internal objects are purged anyways. – MERose Aug 11 '19 at 19:52
  • 3
    But notice that df is not an internal object. It's an argument to the function, and as long as you never assign to it (`df = something`) it remains a shallow copy and is changed in-place. In this case, columns would be added to the DataFrame. Try copy-pasting this function and running it without the `del` line, and see that it changes the given DataFrame by adding columns. – ErnestScribbler Aug 13 '19 at 14:59
  • This does not answer the question, because weighted average just serves as an example to any aggregate on multiple columns. – user__42 Jul 03 '20 at 12:51
8

Here's a solution which has the following benefits:

  1. You don't need to define a function in advance
  2. You can use it within a pipe (since it's using lambda)
  3. You can name the resulting column

:

df.groupby('group')
  .apply(lambda x: pd.Series({
'weighted_average': np.average(x.data, weights = x.weights)})

You can also use the same code to perform multiple aggregations:

df.groupby('group')
  .apply(lambda x: pd.Series({
'weighted_average': np.average(x.data, weights = x.weights), 
'regular_average': np.average(x.data)}))
Iyar Lin
  • 581
  • 4
  • 13
5

I do this a lot and found the following quite handy:

def weighed_average(grp):
    return grp._get_numeric_data().multiply(grp['COUNT'], axis=0).sum()/grp['COUNT'].sum()
df.groupby('SOME_COL').apply(weighed_average)

This will compute the weighted average of all the numerical columns in the df and drop non-numeric ones.

santon
  • 4,395
  • 1
  • 24
  • 43
  • This is blazing fast! Great job! – Shay Ben-Sasson Nov 17 '16 at 17:49
  • This is really sweet if you have multiple columns. Nice! – Chris Dec 20 '16 at 02:48
  • @santon, thanks for the answer. Could you give an example of your solution? I got an error 'KeyError: 'COUNT' while trying to use your solution. – Allen Qin Feb 12 '17 at 18:16
  • @Allen You should use whatever the name of the column is that has the counts you want to use for the weighted average. – santon Feb 13 '17 at 01:22
  • this is cool and very fast. I wish I could pass the counts column into the function too. Then it can be used more generically. Tried it, but got *No axis named nt_pop for object type DataFrame*, where nt_pop was the column that has the counts. – SModi Mar 22 '21 at 21:05
  • 1
    @SModi You can easily modify the function to `def weighted_average(grp, col='COUNT'): ...` to parametrize the column name. (Of course, replace `'COUNT'` with `col` in the code as well.) Then, simply call `apply(lambda g: weigthed_average(g, "nt_pop")`. – santon Mar 24 '21 at 03:35
  • 1
    Thanks this worked. I'd edited the function correctly when I tried earlier, but I wasn't calling it correctly. – SModi Mar 25 '21 at 19:41
4

The following (based on Wes McKinney' answer) accomplishes exactly what I was looking for. I'd be happy to learn if there's a simpler way of doing this within pandas.

def wavg_func(datacol, weightscol):
    def wavg(group):
        dd = group[datacol]
        ww = group[weightscol] * 1.0
        return (dd * ww).sum() / ww.sum()
    return wavg


def df_wavg(df, groupbycol, weightscol):
    grouped = df.groupby(groupbycol)
    df_ret = grouped.agg({weightscol:sum})
    datacols = [cc for cc in df.columns if cc not in [groupbycol, weightscol]]
    for dcol in datacols:
        try:
            wavg_f = wavg_func(dcol, weightscol)
            df_ret[dcol] = grouped.apply(wavg_f)
        except TypeError:  # handle non-numeric columns
            df_ret[dcol] = grouped.agg({dcol:min})
    return df_ret

The function df_wavg() returns a dataframe that's grouped by the "groupby" column, and that returns the sum of the weights for the weights column. Other columns are either the weighted averages or, if non-numeric, the min() function is used for aggregation.

dslack
  • 835
  • 6
  • 17
4

Accomplishing this via groupby(...).apply(...) is non-performant. Here's a solution that I use all the time (essentially using kalu's logic).

def grouped_weighted_average(self, values, weights, *groupby_args, **groupby_kwargs):
   """
    :param values: column(s) to take the average of
    :param weights_col: column to weight on
    :param group_args: args to pass into groupby (e.g. the level you want to group on)
    :param group_kwargs: kwargs to pass into groupby
    :return: pandas.Series or pandas.DataFrame
    """

    if isinstance(values, str):
        values = [values]

    ss = []
    for value_col in values:
        df = self.copy()
        prod_name = 'prod_{v}_{w}'.format(v=value_col, w=weights)
        weights_name = 'weights_{w}'.format(w=weights)

        df[prod_name] = df[value_col] * df[weights]
        df[weights_name] = df[weights].where(~df[prod_name].isnull())
        df = df.groupby(*groupby_args, **groupby_kwargs).sum()
        s = df[prod_name] / df[weights_name]
        s.name = value_col
        ss.append(s)
    df = pd.concat(ss, axis=1) if len(ss) > 1 else ss[0]
    return df

pandas.DataFrame.grouped_weighted_average = grouped_weighted_average
Nathaniel
  • 51
  • 5
  • 1
    When you say non-performant. How much is the difference? Have measured it? – Bouncner Mar 22 '17 at 11:32
  • Talking about "non-performant": copying "self" each time to "df" and applying "sum" to whole df, does it look "performant", or even "sound" (some columns may contain values which are not meant to be summed)? – serge.v May 29 '21 at 09:46
1

You can implement this function in the following way:

(df['c'] * df['w']).groupby(df['groups']).sum() / df.groupby('groups')['w'].sum()

For example:

df = pd.DataFrame({'groups': [1, 1, 2, 2], 'c': [3, 3, 4, 4], 'w': [5, 5, 6, 6]})
(df['c'] * df['w']).groupby(df['groups']).sum() / df.groupby('groups')['w'].sum()

Result:

groups
1    3.0
2    4.0
dtype: float64
Mykola Zotko
  • 15,583
  • 3
  • 71
  • 73
0

Adding to Wes MacKinney answer, this will rename the aggregated column:

grouped = df.groupby(keys)

def wavg(group):
    d = group['data']
    w = group['weights']
    return (d * w).sum() / w.sum()

grouped.apply(wavg).reset_index().rename(columns={0 : "wavg"})
s510
  • 2,271
  • 11
  • 18