Suppose I have the following data frame:
group num value
a 3 20
a 5 5
b 5 10
b 10 5
b 2 25
Now, I want to compute the weighted average of columns num
and value
grouping by column group
. Using tidyverse packages in R, this is straightforward:
> library(tidyverse)
> df <- tribble(
~group , ~num , ~value,
"a" , 3 , 20,
"a" , 5 , 5,
"b" , 5 , 10,
"b" , 10 , 5,
"b" , 2 , 25
)
> df %>%
group_by(group) %>%
summarise(new_value = sum(num * value) / sum(num))
# A tibble: 2 x 2
group new_value
<chr> <dbl>
1 a 10.6
2 b 8.82
Using Pandas in Python, I can make all intermediary computation beforehand, and then use sum()
to sum up the variables, and then perform the division using transform()
like this:
import pandas as pd
from io import StringIO
data = StringIO(
"""
group,num,value
a,3,20
a,5,5
b,5,10
b,10,5
b,2,25
""")
df = pd.read_csv(data)
df["tmp_value"] = df["num"] * df["value"]
df = df.groupby(["group"]) \
[["num", "tmp_value"]] \
.sum() \
.transform(lambda x : x["tmp_value"] / x["num"], axis="columns")
print(df)
# group
# a 10.625000
# b 8.823529
# dtype: float64
Note that we explicitly need first to subset the columns of interest ([["num", "tmp_value"]]
), compute the sum (sum()
), and then the average/division using transform()
. In R, we write this in just one simple step, much more compact and readable, IMHO.
Now, how can I accomplish that elegancy using Pandas? In other words, can it be more clean, elegant, and mainly easy to read as we do in R?