How to aggregate row values if other row values are same in DataFrame?

Question

For a DataFrame:

>>> df = DataFrame([['2021-03-31', 'A0019', '990RT', 'OFFSET', '0.10'],['2021-03-31', 'A1019', '990CT', 'MARK', '0.10'],['2021-03-31', 'A0019', '990RT', 'M
ARK', '100'],['2021-03-31', 'A0019', '990RT', 'OFFSET', '0.70'],['2021-03-31', 'A0029', '990CT', 'OFFSET', '1.10'],['2021-03-31', 'A0029', '990CT', 'MARK',
 '9.10'],['2021-03-31', 'A0019', '990QT', 'MARK', '99.10'], ['2021-03-31', 'C0019', '990QT', 'OFFSET', '1'], ['2021-03-31', 'C0019', '990QT', 'GHTC', '5'],
['2021-03-31', 'C0019', '990QT', 'OFFSET', '15']], columns=['DATE','A_ID','R_ID','TYPE','I_VAL'] )
>>> df
     DATE       A_ID   R_ID    TYPE  I_VAL
0  2021-03-31  A0019  990RT  OFFSET   0.10
1  2021-03-31  A1019  990CT    MARK   0.10
2  2021-03-31  A0019  990RT    MARK    100
3  2021-03-31  A0019  990RT  OFFSET   0.70
4  2021-03-31  A0029  990CT  OFFSET   1.10
5  2021-03-31  A0029  990CT    MARK   9.10
6  2021-03-31  A0019  990QT    MARK  99.10
7  2021-03-31  C0019  990QT  OFFSET      1
8  2021-03-31  C0019  990QT    GHTC      5
9  2021-03-31  C0019  990QT  OFFSET     15

Each NON OFFSET (e.g. MARK, GHTC) row uniquely matches to ZERO or more OFFSET rows based on the combination of DATE, A_ID, R_ID. That is, there is a ONE-TO-MANY relation between NON OFFSET (e.g. MARK) to OFFSET rows.

I need to complete an operation in two steps:

Aggregate the values of rows if the value DATE, A_ID, R_ID is same. Put the aggregated value as the value of I_VAL in the NON OFFSET row.
Remove rows with TYPE OFFSET.

The resultant DataFrame is:

# The rows with TYPE OFFSET are removed from resulting df.
# Keeping the OFFSET rows for explaining aggregation
# 0, 1, 2, 3, etc. are the indexes (row number) of the rows

    DATE       A_ID   R_ID    TYPE    I_VAL
0  2021-03-31  A0019  990RT  OFFSET   0.10   
1  2021-03-31  A1019  990CT    MARK   0.10   # no update, condition not met
2  2021-03-31  A0019  990RT    MARK   100.80 # updated with sum of 0, self, and 3
3  2021-03-31  A0019  990RT  OFFSET   0.70
4  2021-03-31  A0029  990CT  OFFSET   1.10
5  2021-03-31  A0029  990CT    MARK   10.20  # updated with sum of own value and 4
6  2021-03-31  A0019  990QT    MARK   99.10  # no update, condition not met
7  2021-03-31  C0019  990QT  OFFSET      1   
8  2021-03-31  C0019  990QT    GHTC      21  # updated with sum of self, 7, and 9
9  2021-03-31  C0019  990QT  OFFSET     15

For step 2, I can do:

filtered_df = df[df.TYPE != 'OFFSET']

But, I couldn't figure out how would I aggregate the values? This post discusses a similar problem, but I couldn't modify it for my requirement.

Anurag Dabas · Accepted Answer · 2021-04-07T09:15:48.520

For your step 1:

Firstly change dtype of 'I_VAL' column from string to float by using astype() method:

df['I_VAL']=df['I_VAL'].astype(float)

Finally You can make use of groupby() method and boolean masking:

df.loc[df['TYPE']!='OFFSET','I_VAL']=df.groupby(['DATE','A_ID','R_ID'],as_index=False,sort=False).transform('sum')[df['TYPE']!='OFFSET']['I_VAL']

Now if you print df you will get your desired output:

#output

      DATE       A_ID   R_ID    TYPE    I_VAL
0   2021-03-31  A0019   990RT   OFFSET  0.1
1   2021-03-31  A1019   990CT   MARK    0.1
2   2021-03-31  A0019   990RT   MARK    100.8
3   2021-03-31  A0019   990RT   OFFSET  0.7
4   2021-03-31  A0029   990CT   OFFSET  1.1
5   2021-03-31  A0029   990CT   MARK    10.2
6   2021-03-31  A0019   990QT   MARK    99.1
7   2021-03-31  C0019   990QT   OFFSET  1.0
8   2021-03-31  C0019   990QT   GHTC    21.0
9   2021-03-31  C0019   990QT   OFFSET  15.0

For your step 2:

Use Boolean Masking:

filtered_df = df[df.TYPE != 'OFFSET']

Now if you print filtered_df you will get your desired output:

#output
      DATE      A_ID    R_ID    TYPE    I_VAL
1   2021-03-31  A1019   990CT   MARK    0.1
2   2021-03-31  A0019   990RT   MARK    100.8
5   2021-03-31  A0029   990CT   MARK    10.2
6   2021-03-31  A0019   990QT   MARK    99.1
8   2021-03-31  C0019   990QT   GHTC    21.0

Hey @Anurag Dabas, Thanks. It helped. I change the `groupby` construct to add multiple columns in the 'sum' logic: `group_by_obj=df.groupby(['DATE','A_ID','R_ID'],as_index=False,sort=False)[['I_VAL', 'P_VAL']].transform('sum')`. Used this *group_by_obj* to perform other operations. — adarsh, Apr 07 '21 at 06:39

How to aggregate row values if other row values are same in DataFrame?

1 Answers1