2
df=pd.read_csv('https://raw.githubusercontent.com/amanaroratc/hello-world/master/test_df.csv')
                      id    rank      date
1991513 FCWFKZVFAHFK7WP4      32    2021-06-01
1991514 FCWEUHFSM2BSQY2N      33    2021-06-01
1991515 FCWFV6T2GGPM8T2P      34    2021-06-01
1991516 FCWEQ8B4QDJJUNEH      35    2021-06-01
1991517 FCWFAUSPJFGDUBRG      36    2021-06-01

I have the above data for 1 month and I want to create a new column delta_rank_7 which tells me the change in rank in last 7 days for each id (NaNs for 2021-06-01 to 2021-06-07)

I can do something like mentioned here Calculating difference between two rows in Python / Pandas

df.set_index('date').diff(periods=7)

but I have multiple entries for each date and I want to do this for each id.

AmanArora
  • 2,379
  • 6
  • 19
  • 22

1 Answers1

2

If there are duplicated id use:

df = df.set_index('date')
df['delta_rank_7'] = df.groupby('id')['rank'].diff(periods=7)

If need differencies by 7 days use DataFrameGroupBy.shift and subtract:

file = 'https://raw.githubusercontent.com/amanaroratc/hello-world/master/test_df.csv'
df=pd.read_csv(file, parse_dates=['date'])

df = df.sort_values(['id','date'])
df = df.merge((df.set_index(['id','date'])['rank']
                 .sub(df.set_index('date').groupby('id')['rank'].shift(7, freq='d'))
                 .reset_index(name='delta_rank_7'))
               )
print (df)
                     id  rank       date  delta_rank_7
0      CBKFGPBZMG48K5SF     2 2021-06-15           NaN
1      CBKFGPBZMG48K5SF    19 2021-06-19           NaN
2      CBKFGPBZMG48K5SF     2 2021-06-21           NaN
3      CBKFGPBZMG48K5SF     2 2021-06-22           0.0
4      CBKFGPBZMG48K5SF    48 2021-06-24           NaN
                ...   ...        ...           ...
10010  FRNEUJZRVQGT94SP   112 2021-06-23          38.0
10011  FRNEUJZRVQGT94SP   109 2021-06-24          35.0
10012  FRNEUJZRVQGT94SP    68 2021-06-27         -73.0
10013  FRNEUJZRVQGT94SP    85 2021-06-28           NaN
10014  FRNEUJZRVQGT94SP   133 2021-06-30          21.0

[10015 rows x 4 columns]
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • Thanks for this. I still need the rank column in final output though. Also this is just a sample dataframe which is part of a bigger dataframe. I need delta_rank_15, delta_rank_30 etc also. what changes should I make to above code? – AmanArora Sep 29 '21 at 05:59
  • @AmanArora - added to answer. For `delta_rank_15` change `.shift(7, freq='d')` and `.reset_index(name='delta_rank_7')` to `.shift(15, freq='d')` and `.reset_index(name='delta_rank_15')` – jezrael Sep 29 '21 at 06:04
  • I should have given the original used data. Could you test it again on `'https://raw.githubusercontent.com/amanaroratc/hello-world/master/test_df2.csv'`? I am getting `NotImplementedError: Not supported for type Int64Index` – AmanArora Sep 29 '21 at 06:34
  • 1
    @AmanArora - There is some typo in solution, I change `file = 'https://raw.githubusercontent.com/amanaroratc/hello-world/master/test_df.csv'` to `file ='https://raw.githubusercontent.com/amanaroratc/hello-world/master/test_df2.csv'` and working perfectly. – jezrael Sep 29 '21 at 06:36
  • 1
    yes, I found the mistake. I have deleted the file `test_df2.csv` from github because it is private data, you can edit your answer – AmanArora Sep 29 '21 at 06:44