1

I'm new to Pandas and want to convert the following simple R code to Pandas for computing both the average and weighted average of a column (in practice, there are many more columns to be aggregated). The solution has to be chainable, as there are multiple steps both before and after this calculation. I have looked at solutions using the apply function (Calculate weighted average using a pandas/dataframe), but then it seems that one either has to do the full aggregation step (on all, perhaps unrelated, columns) inside inside the apply function, which I find ugly, or compute the average and weighted average separately and then afterwards do a table join. What is the state of the art way to do this in Pandas?

df = data.frame(batch=c("A", "A", "B", "B", "C","C"), value=1:6, weight=1:6)
df %>% 
  group_by(batch) %>% 
  summarise(avg = mean(value), avg_weighted = sum(value*weight)/sum(weight))

# A tibble: 3 x 3
  batch   avg avg_weighted
  <chr> <dbl>        <dbl>
1 A       1.5         1.67
2 B       3.5         3.57
3 C       5.5         5.55

And here my Pandas attempt:

df2 = pd.DataFrame({'batch': ["A", "A", "B", "B", "C", "C"], 'value':[1,2,3,4,5,6], 'weight':[1,2,3,4,5,6]})

def agg_step(grp):
    return pd.DataFrame({'avg':[grp['value'].mean()], 
          'avg_weighted':np.average(grp['value'], weights=grp['weight'])})

(df2.
    groupby('batch')
    .apply(agg_step)
    .reset_index()
    .drop(columns='level_1')
)

Out[93]: 
  batch  avg  avg_weighted
0     A  1.5      1.666667
1     B  3.5      3.571429
2     C  5.5      5.545455
Mattias
  • 11
  • 2
  • Can you post the output as well? Also if you can post the python version of either of the two options you mentioned it would go a long way to showing you've tried to solve it yourself. – Dan Oct 21 '19 at 09:33
  • Post updated according to suggestions. – Mattias Oct 21 '19 at 12:43
  • Possible duplicate of [groupby weighted average and sum in pandas dataframe](https://stackoverflow.com/questions/31521027/groupby-weighted-average-and-sum-in-pandas-dataframe) – Dan Oct 21 '19 at 13:14

1 Answers1

0

This should work:

(df2.groupby("batch")
    .agg({
        "value": [
            "mean", 
            lambda x: np.average(x, weights=df2.loc[x.index, "weight"])
         ]
     }))

based on https://stackoverflow.com/a/31521177/1011724

Dan
  • 45,079
  • 17
  • 88
  • 157
  • This is not a chained solution, because it references df2 inside the aggregation. If you want to compute the weighed average of another dataframe, you would need to update the code. – Mattias Oct 21 '19 at 13:42
  • wrap it in a function then? – Dan Oct 21 '19 at 13:46