2

So I know how to create a new column based on the difference between consecutive columns, here. But I want to do this for multiple subsets of a dataframe, i.e, for a .groupby('zip')

     zip   year   val
0  48123   2013    10
1  48123   2014    11
2  48123   2015    11
3  60122   2013    13
4  60122   2014    10
5  60122   2015    10

Would produce

     zip   year   val   dVal
0  48123   2013    10      1
1  48123   2014    11      0 
2  48123   2015    11    NaN
3  60122   2013    13     -3
4  60122   2014    10      1
5  60122   2015    11    NaN

Right now I'm looping over my dataframe using

for index, group in df.groupby('zip'):
    group.loc['dVal'] =  group['val'].shift(-1) - group['val']

But I have over a million groups, so it's taking a while, is there a better way to be doing this?

Community
  • 1
  • 1
SharpObject
  • 597
  • 5
  • 17

1 Answers1

4

You can use sub for substraction and remove loop:

df['dVal'] = df.groupby('zip')['val'].shift(-1).sub(df['val'])
print (df)
     zip  year  val  dVal
0  48123  2013   10   1.0
1  48123  2014   11   0.0
2  48123  2015   11   NaN
3  60122  2013   13  -3.0
4  60122  2014   10   0.0
5  60122  2015   10   NaN
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252