86

I have a dataframe:

    Out[78]: 
   contract month year  buys  adjusted_lots    price
0         W     Z    5  Sell             -5   554.85
1         C     Z    5  Sell             -3   424.50
2         C     Z    5  Sell             -2   424.00
3         C     Z    5  Sell             -2   423.75
4         C     Z    5  Sell             -3   423.50
5         C     Z    5  Sell             -2   425.50
6         C     Z    5  Sell             -3   425.25
7         C     Z    5  Sell             -2   426.00
8         C     Z    5  Sell             -2   426.75
9        CC     U    5   Buy              5  3328.00
10       SB     V    5   Buy              5    11.65
11       SB     V    5   Buy              5    11.64
12       SB     V    5   Buy              2    11.60

I need a sum of adjusted_lots , price which is weighted average , of price and adjusted_lots , grouped by all the other columns , ie. grouped by (contract, month , year and buys)

Similar solution on R was achieved by following code, using dplyr, however unable to do the same in pandas.

> newdf = df %>%
  select ( contract , month , year , buys , adjusted_lots , price ) %>%
  group_by( contract , month , year ,  buys) %>%
  summarise(qty = sum( adjusted_lots) , avgpx = weighted.mean(x = price , w = adjusted_lots) , comdty = "Comdty" )

> newdf
Source: local data frame [4 x 6]

  contract month year comdty qty     avgpx
1        C     Z    5 Comdty -19  424.8289
2       CC     U    5 Comdty   5 3328.0000
3       SB     V    5 Comdty  12   11.6375
4        W     Z    5 Comdty  -5  554.8500

is the same possible by groupby or any other solution ?

Asclepius
  • 57,944
  • 17
  • 167
  • 143
samsri
  • 987
  • 1
  • 7
  • 6

8 Answers8

171

EDIT: update aggregation so it works with recent version of pandas

To pass multiple functions to a groupby object, you need to pass a tuples with the aggregation functions and the column to which the function applies:

# Define a lambda function to compute the weighted mean:
wm = lambda x: np.average(x, weights=df.loc[x.index, "adjusted_lots"])

# Define a dictionary with the functions to apply for a given column:
# the following is deprecated since pandas 0.20:
# f = {'adjusted_lots': ['sum'], 'price': {'weighted_mean' : wm} }
# df.groupby(["contract", "month", "year", "buys"]).agg(f)

# Groupby and aggregate with namedAgg [1]:
df.groupby(["contract", "month", "year", "buys"]).agg(adjusted_lots=("adjusted_lots", "sum"),  
                                                      price_weighted_mean=("price", wm))

                          adjusted_lots  price_weighted_mean
contract month year buys                                    
C        Z     5    Sell            -19           424.828947
CC       U     5    Buy               5          3328.000000
SB       V     5    Buy              12            11.637500
W        Z     5    Sell             -5           554.850000

You can see more here:

and in a similar question here:

[1] : https://pandas.pydata.org/pandas-docs/stable/whatsnew/v0.25.0.html#groupby-aggregation-with-relabeling

starball
  • 20,030
  • 7
  • 43
  • 238
jrjc
  • 21,103
  • 9
  • 64
  • 78
  • I tried this and I got what I wanted except, that first few columns became multi indices. Could you please let us know how we can fix that ? I want it to be a data frame with all the index as columns. as in your example ```(contract,month,year.buys)``` became indices. – Saurabh Bade May 09 '20 at 05:26
  • have you tried `df.reset_index()` ? Otherwise I suggest you open a new question with what you have, what you tried and what you'd like. – jrjc May 11 '20 at 16:04
  • 2
    can i clarify what does x.index represent from wm = lambda x: np.average(x, weights=df.loc[x.index, "adjusted_lots"]) ? @jrjc – Lko Aug 31 '20 at 14:48
  • Great solution! It saved me here. Is there a way to do it in a expanding().mean() rationale? – Daniel Arges Mar 23 '21 at 18:50
  • Hey guys, I would like to recommend the answers below since they are not only easy to remember but also vectorized, so much faster. – W. Ding Jun 29 '21 at 07:32
  • This is great thanks! Is there a way to prevent ZeroDivisionError besides filtering out adjusted_lots =0 ? – Niv Cohen Aug 17 '21 at 19:57
16

Doing weighted average by groupby(...).apply(...) can be very slow (100x from the following). See my answer (and others) on this thread.

def weighted_average(df,data_col,weight_col,by_col):
    df['_data_times_weight'] = df[data_col]*df[weight_col]
    df['_weight_where_notnull'] = df[weight_col]*pd.notnull(df[data_col])
    g = df.groupby(by_col)
    result = g['_data_times_weight'].sum() / g['_weight_where_notnull'].sum()
    del df['_data_times_weight'], df['_weight_where_notnull']
    return result
ErnestScribbler
  • 2,667
  • 1
  • 18
  • 13
  • 2
    your solution doesn't include 'adjusted_lot sums' from the original problem that needs to be there. – antonio_zeus Jan 13 '18 at 21:17
  • 1
    in case someone also needs help understanding this, weighted average is normally: val_0 * weight_0 + val_1 * weight_1 + ... + val_n * weight_n, where all the weights sum up to 1.0. When sum(weight) != 0, you need to normalize it but dividing it by the total weight. In this method, we calculate the val_i * non-normalized_weight_i (_data_times_weight) and the separate non_normalized_weight_i (when data is not null, _weight_where_notnull). We then group and then sum each group's val_i*non_normalized_weight_i and divide by the total non-normalized_weight of the group in order to normalize it – ajoseps Apr 19 '22 at 17:06
14

Wouldn't it be a lot more simpler to do this.

  1. Multiply (adjusted_lots * price_weighted_mean) into a new column "X"
  2. Use groupby().sum() for columns "X" and "adjusted_lots" to get grouped df df_grouped
  3. Compute weighted average on the df_grouped as df_grouped['X']/df_grouped['adjusted_lots']
8

The solution that uses a dict of aggregation functions will be deprecated in a future version of pandas (version 0.22):

FutureWarning: using a dict with renaming is deprecated and will be removed in a future 
version return super(DataFrameGroupBy, self).aggregate(arg, *args, **kwargs)

Use a groupby apply and return a Series to rename columns as discussed in: Rename result columns from Pandas aggregation ("FutureWarning: using a dict with renaming is deprecated")

def my_agg(x):
    names = {'weighted_ave_price': (x['adjusted_lots'] * x['price']).sum()/x['adjusted_lots'].sum()}
    return pd.Series(names, index=['weighted_ave_price'])

produces the same result:

>df.groupby(["contract", "month", "year", "buys"]).apply(my_agg)

                          weighted_ave_price
contract month year buys                    
C        Z     5    Sell          424.828947
CC       U     5    Buy          3328.000000
SB       V     5    Buy            11.637500
W        Z     5    Sell          554.850000
Mark Greenwood
  • 126
  • 1
  • 5
  • I have zero values in my dataframe for the weighted column. In this case it would be the adjusted_lots. I want to avoid this warning: **RuntimeWarning: invalid value encountered in double_scalars.** What's an efficient way to do that using this setup? A possible method would be: [link](https://stackoverflow.com/questions/49864679/pandas-numpy-weighted-average-zerodivisionerror), but I do not know how to combine this with your solution. Insight would be appreciated. – Inder Jalli Feb 10 '20 at 14:58
  • 2
    The way, you phrase your answer, might be misread. What will be deprecated is the ability to aggregate and rename and add a mulitlevel index at the same time. You can still use the dictionary to use different and multiple aggregation functions per column in future. – Buggy Mar 18 '20 at 12:55
  • @InderJalli Perhaps [this answer](https://stackoverflow.com/a/69787938/) will help with handling zeros. – Asclepius Nov 01 '21 at 01:48
3

With datar, you don't have to learn pandas APIs to transition your R code:

>>> from datar.all import f, tibble, c, rep, select, summarise, sum, weighted_mean, group_by
>>> df = tibble(
...     contract=c('W', rep('C', 8), 'CC', rep('SB', 3)),
...     month=c(rep('Z', 9), 'U', rep('V', 3)),
...     year=5,
...     buys=c(rep('Sell', 9), rep('Buy', 4)),
...     adjusted_lots=[-5, -3, -2, -2, -3, -2, -3, -2, -2, 5, 5, 5, 2],
...     price=[554.85, 424.50, 424.00, 423.75, 423.50, 425.50, 425.25, 426.00, 426.75,3328.00, 11.65, 11.64, 1
1.60]
... )
>>> df
   contract month  year  buys  adjusted_lots    price
0         W     Z     5  Sell             -5   554.85
1         C     Z     5  Sell             -3   424.50
2         C     Z     5  Sell             -2   424.00
3         C     Z     5  Sell             -2   423.75
4         C     Z     5  Sell             -3   423.50
5         C     Z     5  Sell             -2   425.50
6         C     Z     5  Sell             -3   425.25
7         C     Z     5  Sell             -2   426.00
8         C     Z     5  Sell             -2   426.75
9        CC     U     5   Buy              5  3328.00
10       SB     V     5   Buy              5    11.65
11       SB     V     5   Buy              5    11.64
12       SB     V     5   Buy              2    11.60
>>> newdf = df >> \
...   select(f.contract, f.month, f.year, f.buys, f.adjusted_lots, f.price) >> \
...   group_by(f.contract, f.month, f.year, f.buys) >> \
...   summarise(
...       qty = sum(f.adjusted_lots), 
...       avgpx = weighted_mean(x = f.price , w = f.adjusted_lots), 
...       comdty = "Comdty"
...   )
[2021-05-24 13:11:03][datar][   INFO] `summarise()` has grouped output by ['contract', 'month', 'year'] (overr
ide with `_groups` argument)
>>> 
>>> newdf
  contract month  year  buys  qty        avgpx  comdty
0        C     Z     5  Sell  -19   424.828947  Comdty
1       CC     U     5   Buy    5  3328.000000  Comdty
2       SB     V     5   Buy   12    11.637500  Comdty
3        W     Z     5  Sell   -5   554.850000  Comdty
[Groups: ['contract', 'month', 'year'] (n=4)]

I am the author of the package. Feel free to submit issues if you have any questions.

Panwen Wang
  • 3,573
  • 1
  • 18
  • 39
1

This combines the original approach by jrjc with the closure approach by MB. It has the advantage of being able to reuse the closure function.

import pandas as pd

def group_weighted_mean_factory(df: pd.DataFrame, weight_col_name: str):
    # Ref: https://stackoverflow.com/a/69787938/
    def group_weighted_mean(x):
        try:
            return np.average(x, weights=df.loc[x.index, weight_col_name])
        except ZeroDivisionError:
            return np.average(x)
    return group_weighted_mean

df = ...  # Define
group_weighted_mean = group_weighted_mean_factory(df, "adjusted_lots")
g = df.groupby(...)  # Define
agg_df = g.agg({'price': group_weighted_mean})

This approach however is needlessly more complex than the answer by Rohit P. In retrospect, I would just use the answer by Rohit P.

Asclepius
  • 57,944
  • 17
  • 167
  • 143
0

ErnestScribbler's answer is much faster than the accepted solution. Here a multivariate analogue:

def weighted_average(df,data_col,weight_col,by_col):
    ''' Now data_col can be a list of variables '''
    df_data = df[data_col].multiply(df[weight_col], axis='index')
    df_weight = pd.notnull(df[data_col]).multiply(df[weight_col], axis='index')
    df_data[by_col] = df[by_col]
    df_weight[by_col] = df[by_col]    
    result = df_data.groupby(by_col).sum() / df_weight.groupby(by_col).sum()
    return result
0

I came across this thread when confronted with a similar problem. In my case, I wanted to generate a weighted metric of a quarterback rating should more than one quarterback have attempted a pass in a given NFL game.

I may change the code if I start running into significant performance issues as I scale. For now, I preferred squeezing my solution into the .agg function alongside other transforms. Happy to see if someone has a simpler solution to achieve the same end. Ultimately, I employed a closure pattern.

The magic of the closure approach, if this is an unfamiliar pattern to a future reader, is that I can still return a simple function to pandas' .agg() method, but I get to do so with some additional information preconfigured from the top-level factory function.

def weighted_mean_factory(*args, **kwargs):
    weights = kwargs.get('w').copy()
    
    def weighted_mean(x):
        x_mask = ~np.isnan(x)
        w = weights.loc[x.index]
        
        if all(v is False for v in x_mask):
            raise ValueError('there are no non-missing x variable values')

        return np.average(x[x_mask], weights=w[x_mask])
    
    return weighted_mean

res_df = df.groupby(['game_id', 'team'])\
    .agg(pass_player_cnt=('attempts', count_is_not_zero),
         completions=('completions', 'sum'), 
         attempts=('attempts', 'sum'),
         pass_yds=('pass_yards', 'sum'),
         pass_tds=('pass_tds', 'sum'), 
         pass_int=('pass_int', 'sum'), 
         sack_taken=('sacks_taken', 'sum'), 
         sack_yds_loss=('sack_yds_loss', 'sum'), 
         longest_completion=('longest_completion', 'max'),
         qbr_w_avg=('qb_rating', weighted_mean_factory(x='qb_rating', w=df['attempts']))
         )

Some basic benchmarking stats on a DataFrame with the shape (5436, 31) are below and are not cause for concern on my end in terms of performance at this stage:

149 ms ± 4.75 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Matt Barstead
  • 235
  • 1
  • 8