How to calculate the sum of a subset within a Dataframe based on multiple columns

Question

I'm relatively new to using Pandas, I have a dataframe that looks like the below, my goal is to replicate the "desired_output" column:

+---------+--------+-------------+-------+------------+----------------+
| Main ID | Sub ID | Senior Flag | order | Dollar Amt | desired_output |
+---------+--------+-------------+-------+------------+----------------+
|       1 | A      | Y           |     1 |         10 |              8 |
|       1 | B      | N           |     2 |          5 |              3 |
|       1 | C      | N           |     3 |          3 |              0 |
|       2 | A-1    | Y           |     1 |         15 |             13 |
|       2 | A-2    | Y           |     2 |         10 |             13 |
|       2 | B      | N           |     3 |          7 |              6 |
|       2 | C      | N           |     4 |          6 |              0 |
+---------+--------+-------------+-------+------------+----------------+

The logic I would like to create in Python for the desired_output column is below:

1) Group by the Main IDs

2) For all of the Sub IDs check if the Senior Flag is Y, if it is only take the sum of the Sub IDs where the Senior Flag is N.

3) For the Sub IDs where the Senior Flag is N, the sum should be of anything where the order is greater than it's current order (e.g. for Main ID 2, The B Sub ID would take the value for C SubID since it's order is greater than B)

I'm sorry I know typically users would want a sample of the code, but the dataframe is much larger than this so for simplicity sake I'm only using the above table.

I appreciate all the Help!!

ansev · Accepted Answer · 2020-03-15T21:01:19.450

We could use GroupBy.sum to get sum by Main ID only where Senior Flag is N, GroupBy.cumsum to get the cumsum Dollar Amt value also when the condition is fulfilled. Series.map in order to assign the value of the sum to the rows where Senior Flag is Y, (assuming that Senior Flag can only take Y or N)

df2 = df.sort_values(['Main ID','order'],ascending = [True,False])
m = df['Senior Flag'].eq('N')
groups = df2.loc[m].groupby('Main ID')['Dollar Amt']
df['desired output'] = groups.cumsum().sub(df['Dollar Amt'])
df.loc[~m, 'desired output'] = df['Main ID'].map(groups.sum())
print(df)

output

   Main ID Sub ID Senior Flag  order  Dollar Amt  desired output
0        1      A           Y      1          10             8.0
1        1      B           N      2           5             3.0
2        1      C           N      3           3             0.0
3        2    A-1           Y      1          15            13.0
4        2    A-2           Y      2          10            13.0
5        2      B           N      3           7             6.0
6        2      C           N      4           6             0.0

Details

print(df.loc[m])
   Main ID Sub ID Senior Flag  order  Dollar Amt
1        1      B           N      2           5
2        1      C           N      3           3
5        2      B           N      3           7
6        2      C           N      4           6

from this dataframe we obtain the sum:

df.loc[m].groupby('Main ID')['Dollar Amt'].sum()
Main ID
1     8
2    13
Name: Dollar Amt, dtype: int64

and then we map it to match the index with the original series:

df['Main ID'].map(df.loc[m].groupby('Main ID')['Dollar Amt'].sum())
0     8
1     8
2     8
3    13
4    13
5    13
6    13
Name: Main ID, dtype: int64

assigning only where Senior Flag is Y:

df.loc[~m, 'desired output'] = df['Main ID'].map(groups.sum())

Thank you, I'm going to try this out. One question I have is for when the senior flag is N, can I do a sum shift? Essentially take the sum of all the rows where the order is greater. For example, if there was a subID after C called D, I would want B to take the sum of C & D. Besides that, I think this will work for the senior flag = Y — MartinW, Mar 15 '20 at 20:43
Thanks, this really helps! I'm incorporating it into my code but one question I have if I'm grouping by multiple columns does the "map" function has to change? If so what does it have to change to? — MartinW, Mar 16 '20 at 04:35
Thanks, I'll update my code but your answer got me the closest to what I needed. Thanks! — MartinW, Mar 17 '20 at 00:36

Valdi_Bo · Answer 2 · 2020-03-15T20:10:07.837

Define the following function:

def fn(row, grp):
    if row['Senior Flag'] == 'Y':
        return grp[grp['Senior Flag'] == 'N']['Dollar Amt'].sum()
    else:
        return grp[grp.order > row.order]['Dollar Amt'].sum()

Then apply it:

df['result'] = df.groupby('Main ID').apply(
    lambda grp: grp.apply(fn, axis=1, grp=grp))\
    .reset_index(level=0, drop=True)

When you print(df) the just created column (result) should be equal to the desired output.

Note: The solution by ansev works only as long as there are no more that 2 rows with Senior Flag == 'N'.

To confirm it, add another row to your sample data, so that it contains:

   Main ID Sub ID Senior Flag  order  Dollar Amt  desired_output
0        1      A           Y      1          10               8
1        1      B           N      2           5               3
2        1      C           N      3           3               0
3        2    A-1           Y      1          15              16
4        2    A-2           Y      2          10              16
5        2      B           N      3           7               9
6        2      C           N      4           6               3
7        2      D           N      5           3               0

(I corrected desired_output accordingly).

When you run his code, the desired output for Main ID == 2 contains [ 16, 16, 6, 3, 0 ] what is different from the desired output.

So the conclusion is that shift(-1, fill_value=0) is not a proper formula in a general case. You have to compute actual sum of values with higher order, what can be performed only with apply.

I think we should try avoid using apply, https://stackoverflow.com/questions/54432583/when-should-i-ever-want-to-use-pandas-apply-in-my-code — ansev, Mar 15 '20 at 19:26

How to calculate the sum of a subset within a Dataframe based on multiple columns

2 Answers2