0

In pandas, how to achieve, grouped by date, in each date group, each row of column A and B multiplied and then summed, and then divided by the sum of all B columns in the date group.

I have tried:

(df.groupby('date')['A','B']
    .transform(lambda x: (x['A'] * x['B']).sum())
    .div(df.groupby('date')['B'].agg('sum')))

and:

(df.groupby('date')
    .transform(lambda x: (x['A'] * x['B']).sum())
    .div(df.groupby('date')['B'].agg('sum')))

both showed:

KeyError: 'A'

wjandrea
  • 28,235
  • 9
  • 60
  • 81
xyww
  • 17
  • 3

1 Answers1

1

You should use .apply() instead of .transform(), as follows:

df.groupby('date').apply(lambda x: (x['A'] * x['B']).sum() / x['B'].sum())

By using .apply(), the x variables in the lambda function will be recognised as DataFrame objects representing each group, which can each be indexed for individual columns.

The .transform() method only treats x as a Series object representing only one column at a time.

From this answer:

Two major differences between `apply` and `transform`

There are two major differences between the transform and apply groupby methods.

  • Input:
    • apply implicitly passes all the columns for each group as a DataFrame to the custom function.
    • while transform passes each column for each group individually as a Series to the custom function.
  • Output:
    • The custom function passed to apply can return a scalar, or a Series or DataFrame (or numpy array or even list).
    • The custom function passed to transform must return a sequence (a one dimensional Series, array or list) the same length as the group.

So, transform works on just one Series at a time and apply works on the entire DataFrame at once.

Example

Sample data:

import pandas as pd

data = {
    'date': ['01/01/1999', '02/01/1999', '03/01/1999', '03/01/1999'],
    'A': [2, 4, 7, 4],
    'B': [5, 7, 9, 6]
}

df = pd.DataFrame(data)
print(df)
         date  A  B
0  01/01/1999  2  5
1  02/01/1999  4  7
2  03/01/1999  7  9
3  03/01/1999  4  6

The calculation for '03/01/1999' would be:

((7 * 9) + (4 * 6)) / (9 + 6) # = 5.8

Calculation for each date group using .apply():

df_ab_calc = df.groupby('date').apply(lambda x: (x['A'] * x['B']).sum() / x['B'].sum())
print(df_ab_calc)
date
01/01/1999    2.0
02/01/1999    4.0
03/01/1999    5.8
user21283023
  • 936
  • 8