weighted average aggregation on multiple columns of df

Question

I'm trying to calculate a weighted average for multiple columns in a dataframe. This is a sample of my data

Group	Year	Month	Weight(kg)	Nitrogen	Calcium
A	2020	01	10000	10	70
A	2020	01	15000	4	78
A	2021	05	12000	5	66
A	2021	05	10000	8	54
B	2021	08	14000	10	90
C	2021	08	50000	20	92
C	2021	08	40000	10	95

My desired result would look something like this:

What I've tried: I can get the correct weighted average values for a single column using this function: (similar to: link)

def wavg(df, value, weight):
    d = df[value]
    w = df[weight]
    try:
        return (d * w).sum() / w.sum()
    except ZeroDivisionError:
        return d.mean()

I can apply this function to a single column of my df:

df2 = df.groupby(["Group", "year", "month"]).apply(wavg, "Calcium", "Weight(kg").to_frame()

(Don't mind the different values, they are correct for the data in my notebook)

The obvious problem is that this function only works for a single column whilst I have a douzens of columns. I therefore tried a for loop:

column_list=[]
for column in df.columns:
  column_list.append(df.groupby(["Group", "year", "month"]).apply(wavg, column, "Weight(kg").to_frame())

It calculates the values correctly, but the columns are placed on top of each other instead of next to eachother. They also miss a usefull column name:

How could I adapt my code to return the desired df?

try concatinating them(on either axis according to your need): `df=pd.concat(column_list)` or `df=pd.concat(column_list,axis=1)` — Anurag Dabas, Jul 12 '21 at 13:06
Thanks, the pd.concat (column_list, axis=1) actually works. I do get a multiindex that needs some flattening but this is a step in the right direction. — brenda89, Jul 12 '21 at 13:41
hi! Is any one of the answers below working? If so & if you wish, you might consider [accepting](https://meta.stackexchange.com/questions/5234/how-does-accepting-an-answer-work/5235#5235) one of them to signal others that the issue is resolved. If not, you can provide feedback so they can be improved (or removed altogether) — Anurag Dabas, Aug 14 '21 at 06:56

jezrael · Accepted Answer · 2021-07-12T13:50:09.503

1

Change function for working by multiple columns and for avoid removing column for grouping are converting to MultiIndex:

def wavg(x, value, weight):
    d = x[value]
    w = x[weight]
    try:
        return (d.mul(w, axis=0)).div(w.sum())
    except ZeroDivisionError:
        return d.mean()

#columns used for groupby
groups = ["Group", "Year", "Month"]
#processing all another columns
cols = df.columns.difference(groups + ["Weight(kg)"], sort=False)

#create index and processing all columns by variable cols
df1 = (df.set_index(groups)
         .groupby(level=groups)
         .apply(wavg, cols, "Weight(kg)")
         .reset_index())
print (df2)
  Group  Year  Month    Calcium   Nitrogen
0     A  2020      1  28.000000   4.000000
1     A  2020      1  46.800000   2.400000
2     A  2021      5  36.000000   2.727273
3     A  2021      5  24.545455   3.636364
4     B  2021      8  90.000000  10.000000
5     C  2021      8  51.111111  11.111111
6     C  2021      8  42.222222   4.444444

edited Jul 12 '21 at 13:50

answered Jul 12 '21 at 13:13

jezrael

822,522
95
1,334
1,252

Thanks for your feedback. The code almost work but during the creation of df2 I get the following error: ValueError: cannot join with no overlapping index names. How could I work around this? – brenda89 Jul 12 '21 at 13:38
@brenda89 - If remove `.reset_index()` it working? – jezrael Jul 12 '21 at 13:40
@brenda89 - What is variable `cols` ? How working all columns not used for groupby like `cols = df.columns.difference(["Group", "Year", "Month"], sort=False)` ? – jezrael Jul 12 '21 at 13:44

Anurag Dabas · Answer 2 · 2021-07-12T13:49:35.873

1

Try via concat() and reset_index():

df=pd.concat(column_list,axis=1).reset_index()

OR

you can make changes here:

column_list=[]
for column in df.columns:
  column_list.append(df.groupby(["Group", "year", "month"]).apply(wavg, column, "Weight(kg").reset_index())

#Finally:

df=pd.concat(column_list,axis=1)

edited Jul 12 '21 at 13:49

answered Jul 12 '21 at 13:43

Anurag Dabas

23,866
9
21
41

1

Possible solution, but in term of performance is better working with all columns only once, not each column separately and then use `concat`. – jezrael Jul 12 '21 at 13:45
@jezrael got it sir ***:)*** – Anurag Dabas Jul 12 '21 at 13:51

weighted average aggregation on multiple columns of df

2 Answers2