Aggregating using custom function and several colums in pandas

Question

Suppose I have the following data frame:

group  num  value
a      3    20
a      5    5
b      5    10
b      10   5
b      2    25

Now, I want to compute the weighted average of columns num and value grouping by column group. Using tidyverse packages in R, this is straightforward:

> library(tidyverse)
> df <- tribble(
    ~group , ~num , ~value,
    "a"    , 3    , 20,
    "a"    , 5    , 5,
    "b"    , 5    , 10,
    "b"    , 10   , 5,
    "b"    , 2    , 25
)
> df %>%
    group_by(group) %>%
    summarise(new_value = sum(num * value) / sum(num))

# A tibble: 2 x 2
  group new_value
  <chr>    <dbl>
1 a        10.6
2 b         8.82

Using Pandas in Python, I can make all intermediary computation beforehand, and then use sum() to sum up the variables, and then perform the division using transform() like this:

import pandas as pd
from io import StringIO

data = StringIO(
"""
group,num,value
a,3,20
a,5,5
b,5,10
b,10,5
b,2,25
""")

df = pd.read_csv(data)

df["tmp_value"] = df["num"] * df["value"]
df = df.groupby(["group"]) \
    [["num", "tmp_value"]] \
    .sum() \
    .transform(lambda x : x["tmp_value"] / x["num"], axis="columns")

print(df)
# group
# a    10.625000
# b     8.823529
# dtype: float64

Note that we explicitly need first to subset the columns of interest ([["num", "tmp_value"]]), compute the sum (sum()), and then the average/division using transform(). In R, we write this in just one simple step, much more compact and readable, IMHO.

Now, how can I accomplish that elegancy using Pandas? In other words, can it be more clean, elegant, and mainly easy to read as we do in R?

score 1 · Answer 1 · answered Jul 30 '20 at 21:23

@an_drade - There has been a very similar stackoverflow question that provides the solution:

Pandas DataFrame aggregate function using multiple columns

The solution to your question is based on the above post by creating a python function:

df=pd.DataFrame([['a',3,20],['a',5,5],['b',5,10],['b',10,5],['b',2,25]],columns=['group','num','value'])
                   
    def wavg(group):
       d = group['num']
       w = group['value']
       return (d*w).sum() / d.sum()
            
   
final=df.groupby("group").apply(wavg)
      
        group
        a    10.625000
        b     8.823529
        dtype: float64

Thanks for pointed it out @Newbie123. It looks simple and elegant enough, but not quite R yet. Well, it works for me. — an_drade, Jul 31 '20 at 03:52

Panwen Wang · Answer 2 · 2021-06-15T04:55:00.630

This is the "R way" you wanted:

>>> from datar import f
>>> from datar.tibble import tribble
>>> from datar.dplyr import group_by, summarise
>>> from datar.base import sum
>>> # or if you are lazy:
>>> # from datar.all import *
>>> 
>>> df = tribble(
...     f.group , f.num , f.value,
...     "a"     , 3     , 20,
...     "a"     , 5     , 5,
...     "b"     , 5     , 10,
...     "b"     , 10    , 5,
...     "b"     , 2     , 25
... )
>>> df >> \
...     group_by(f.group) >> \
...     summarise(new_value = sum(f.num * f.value) / sum(f.num))
     group  new_value
  <object>  <float64>
0        a  10.625000
1        b   8.823529

I am the author of the datar package. Please feel free to submit issues if you have any questions about using it.

Aggregating using custom function and several colums in pandas

2 Answers2