0

I have a dataframe

import pandas as pd
df = pd.DataFrame({ 
'ID': [1, 1, 1, 2, 2, 2], 
'value': [100, 120, 130, 200, 190, 210],
'value2': [2100, 2120, 2130, 2200, 2190, 2210],   
'state': ['init','mid', 'final', 'init', 'mid', 'final'], 
})

I want to add another row, which is the difference of 'final' and 'init' in the state column for each ID.

The result should look like

import pandas as pd
df = pd.DataFrame({ 
    'ID': [1, 1, 1, 2, 2, 2, 1, 2], 
    'value': [100, 120, 130, 200, 190, 210, 30, 10],
    'value2': [2100, 2120, 2130, 2200, 2190, 2210, 100, 10],   
     'state': ['init','mid', 'final', 'init', 'mid', 'final', 'diff', 'diff'], 
 })

I have tried the following. But I get AttributeError: 'function' object has no attribute 'groupby' error

df1 = df.copy()

df1 = df[df.state.isin(['final', 'init'])].copy
s = df1.groupby('ID', sort=False).diff().dropna.assign(id=df1['ID'].unique(), state='diff')
df = df.append(s, sort=True).sort_values("ID")
postcolonialist
  • 449
  • 7
  • 17

2 Answers2

1

What about something like this? Group by ID, keep only the value* cols, take the diff, and reassign ID and state:

df
ID value value2 state
0 1 100 2100 init
1 1 120 2120 mid
2 1 130 2130 final
3 2 200 2200 init
4 2 190 2190 mid
5 2 210 2210 final
diffs = (df[df.state.isin(['init', 'final'])]
    .groupby('ID')[['value', 'value2']]
    .diff().dropna()
    .assign(ID=df.ID.unique(), state='diff')
)
ID value value2 state
0 1 30.0 30.0 diff
1 2 10.0 10.0 diff

Then combine back into df:

df = df.append(diffs)
ID value value2 state
0 1 100.0 2100.0 init
1 1 120.0 2120.0 mid
2 1 130.0 2130.0 final
3 2 200.0 2200.0 init
4 2 190.0 2190.0 mid
5 2 210.0 2210.0 final
0 1 30.0 30.0 diff
1 2 10.0 10.0 diff
tdy
  • 36,675
  • 19
  • 86
  • 83
1

Another option is to take the difference betweent the final state and the init as a matrix operation and then fill the missing columns ID and state passing a dictionary of values.

val_cols = ['value', 'value2']
df_final = df.query('state == "final"')[val_cols].reset_index(drop=True) - df.query('state== "init"')[val_cols].reset_index(drop=True)

df_final = pd.concat([df, df_final])

values = {'ID': {i:i+1 for i, v in enumerate(df.ID.unique())}, 'state': 'diff'}
df_final.fillna(values, inplace=True)

Final Result

enter image description here

Miguel Trejo
  • 5,913
  • 5
  • 24
  • 49
  • This solution works well in this example. But if the IDs were not non numeric or were not sequentially arranged, values = {'ID': {i:i+1 for i, v in enumerate(df.ID.unique())}, 'state': 'diff'} does not give the right output for ID / group. How might one fix that? – postcolonialist Mar 14 '21 at 19:49
  • 1
    yes @postcolonialist, in that case sorting `df_final` by id and state before applying the operations should work – Miguel Trejo Mar 16 '21 at 14:19