9

Say I have the following dataframe:

>>> df=pd.DataFrame({'category':['a','a','b','b'],
... 'var1':np.random.randint(0,100,4),
... 'var2':np.random.randint(0,100,4),
... 'weights':np.random.randint(0,10,4)})
>>> df
  category  var1  var2  weights
0        a    37    36        7
1        a    47    20        1
2        b    33     7        6
3        b    16     6        8

I can calculate the weighted average of a 'var1' as such:

>>> Grouped=df.groupby('category')
>>> GetWeightAvg=lambda g: np.average(g['var1'], weights=g['weights'])
>>> Grouped.apply(GetWeightAvg)
category
a    38.250000
b    23.285714
dtype: float64

However I am wondering if there is a way I can write my function and apply it to my grouped object such that I can specify when applying it, which column I want to calculate for (or both). Rather than have 'var1' written into my function, I'd like to be able to specify when applying the function.

Just as I can get an unweighted average of both columns like this:

>>> Grouped[['var1','var2']].mean()
          var1  var2
category            
a         42.0  28.0
b         24.5   6.5

I'm wondering if there is a parallel way to do that with weighted averages.

AJG519
  • 3,249
  • 10
  • 36
  • 62

2 Answers2

14

You can apply and return both averages:

In [11]: g.apply(lambda x: pd.Series(np.average(x[["var1", "var2"]], weights=x["weights"], axis=0), ["var1", "var2"]))
Out[11]:
               var1       var2
category
a         38.250000  34.000000
b         23.285714   6.428571

You could write this slightly cleaner as a function:

In [21]: def weighted(x, cols, w="weights"):
             return pd.Series(np.average(x[cols], weights=x[w], axis=0), cols)

In [22]: g.apply(weighted, ["var1", "var2"])
Out[22]:
               var1       var2
category
a         38.250000  34.000000
b         23.285714   6.428571
Andy Hayden
  • 359,921
  • 101
  • 625
  • 535
  • That is helpful. However, some of my values for one column (not the others) are NaN. For those, I would like to calculate the average by dropping the NaN values and using the others. This solution instead returns NaN for those columns. This https://stackoverflow.com/questions/43049014/python-take-weighted-average-inside-pandas-groupby-while-ignoring-nan does not treat multiple columns. np.nanmean does not allow weights. np.average does not allow options to control treatment of NaNs. – CPBL Dec 01 '19 at 22:25
0

Following up from Andy's solution, I was seeking to use one of the index levels from a multi index as my weights.

np.random.seed(1)
arrays = [list('AAABBB'), [0.01,0.02,0.03,0.07,0.09,0.11]]
tups = list(zip(*arrays))
x = pd.MultiIndex.from_tuples(tups)
df = pd.DataFrame(index=x,data= np.random.randint(10,100,(6,6)),columns = list('STUVWX'))
df.index.names = ['bin','prob']

             S   T   U   V   W   X
bin prob                          
A   0.0100  47  22  82  19  85  15
    0.0200  89  74  26  11  86  81
    0.0300  16  35  60  30  28  94
B   0.0700  21  38  39  24  60  78
    0.0900  97  97  96  23  19  17
    0.1100  73  71  32  67  11  10

Adapting function to use one of index levels as the weights.

def weighted(x, w="weights"):
    return pd.Series(np.average(x, weights=x.index.get_level_values(w), axis=0),index= x.columns)

and calling

df.groupby(level=['bin']).apply(weighted,  "prob")

which gives:

          S       T       U       V       W       X
bin                                                
A   45.5000 45.8333 52.3333 21.8333 56.8333 76.5000
B   67.5185 71.1111 55.1481 41.1852 26.3704 29.9630
Dickster
  • 2,969
  • 3
  • 23
  • 29