Return the most recent three days of values from another dataframe with missing dates

Question

df1

    trade_date  pct_chg
266 2019-09-20  0.2390
265 2019-09-23  -0.9769
264 2019-09-24  0.2776
263 2019-09-25  -1.0018
262 2019-09-26  -0.8914
261 2019-09-27  0.1182
260 2019-09-30  -0.9201
259 2019-10-08  0.2885
258 2019-10-09  0.3874
257 2019-10-10  0.7814

df2

        obs_date     last_recent_day_pct_chg  last_second_day_pct_chg  avg_l3d_pct_chg  max_l3d_pct_chg  
  a     2019/9/21   
  b     2019/9/22   
  c     2019/9/23   
  d     2019/9/24   
  e     2019/9/25   
  f     2019/9/26   
  g     2019/9/27   
  h     2019/9/28   
  i     2019/9/29   
  j     2019/9/30   
  k     2019/10/5   
  l     2019/10/8   
  m     2019/10/9
  n     2019/10/9
  o     2019/9/29

Notice: There is only index and obs_date in df2 orginally.

Goal

I want to get value of df2 columns based on df1 as below:

last_recent_day_pct_chg : If obs_date can be found in trade_date, it will return corresponding pct_chg.(e.g. 2019/9/23: -0.9769). If not, it will return corresponding last_recent_day pct_chg.(e.g. 2019/9/21: 0.2390 , 2019/10/5:-0.9201).
last_second_day_pct_chg : If obs_date can be found in trade_date,it will return corresponding last day pct_chg.(e.g. 2019/9/24: 0.2776).If not, it will return nearest corresponding last_second_day_pct_chg .(e.g. 2019/10/5: 0.1182 , 2019/9/23 : NULL).
avg_l3d_pct_chg/max_l3d_pct_chg: Based on obs_date, it returns average or max value of pct_chg from last recent 3 trade_date.(e.g. 2019/10/5, 2019/9/26,9/27,9/30, 2019/10/09, 2019/10/09,10/8, 9/30 )

Try

I consider shift method and this post.But it seems not working.

David Erickson · Accepted Answer · 2020-11-01T19:52:25.213

You can create four series based off df1 and merge them into your df2 after also using .shift() and then resample('1D'). These four series are:

s0 - A straightforward merge of the data from df1 to df2, which results in some NaN values.
s1 - A merge of the data from df1 to df2, but with ffill(), so you can get the previous day if NaN.
s2 - A merge of the data from df1 to df2, but with .shift(1) and ffill(), so you can get 2 days prior if NaN.
s3 - A merge of the data from df1 to df2, but with .shift(2) and ffill(), so you can get 3 days prior if NaN.

From there, you can derive the mean and max, as well as manipulate 1 and 2 ('last_recent_day_pct_chg' and 'last_second_day_pct_chg', respectively) and then rename the columns at the end, to keep it syntactically cleaner since the column names are long.

NOTE: It was a little bit confusing about what you want the output to be for 'last_recent_day_pct_chg' and 'last_second_day_pct_chg' (I think your expected ouput migh have erroes in your question), but you can easily adjust these two rows if the output is incorrect.

df2[1] = np.where(df2[0].notnull(), df2[0], df2[1])
df2[2] = np.where(df2[0].notnull(), df2[0], df2[2])

Full Code:

df1 = df1.rename({'trade_date' : 'obs_date'}, axis=1)
df1['obs_date'], df2['obs_date'] = pd.to_datetime(df1['obs_date']), pd.to_datetime(df2['obs_date'])
s = pd.Series(df1['pct_chg'].tolist(), df1['obs_date'])
s0 = s.resample('1D').mean()
s1 = s.resample('1D').mean().ffill()
s2 = s.shift(1).resample('1D').mean().ffill()
s3 = s.shift(2).resample('1D').mean().ffill()
df2 = df2.merge(pd.concat([s0,s1,s2,s3], axis=1), right_index=True, left_on='obs_date')
df2[3] = df2.iloc[:,2:5].mean(axis=1)
df2[4] = df2.iloc[:,2:5].max(axis=1)
df2[1] = np.where(df2[0].notnull(), df2[0], df2[1])
df2[2] = np.where(df2[0].notnull(), df2[0], df2[2])
df2 = df2.drop(0, axis=1).rename({1:'last_recent_day_pct_chg', 2:'last_second_day_pct_chg',
                                  3:'avg_l3d_pct_chg', 4:'max_l3d_pct_chg'}, axis=1)
df2

Out[1]: 
    obs_date  last_recent_day_pct_chg  last_second_day_pct_chg  \
a 2019-09-21                   0.2390                      NaN   
b 2019-09-22                   0.2390                      NaN   
c 2019-09-23                  -0.9769                  -0.9769   
d 2019-09-24                   0.2776                   0.2776   
e 2019-09-25                  -1.0018                  -1.0018   
f 2019-09-26                  -0.8914                  -0.8914   
g 2019-09-27                   0.1182                   0.1182   
h 2019-09-28                   0.1182                  -0.8914   
i 2019-09-29                   0.1182                  -0.8914   
o 2019-09-29                   0.1182                  -0.8914   
j 2019-09-30                  -0.9201                  -0.9201   
k 2019-10-05                  -0.9201                   0.1182   
l 2019-10-08                   0.2885                   0.2885   
m 2019-10-09                   0.3874                   0.3874   
n 2019-10-09                   0.3874                   0.3874   

   avg_l3d_pct_chg  max_l3d_pct_chg  
a         0.239000         0.239000  
b         0.239000         0.239000  
c        -0.368950         0.239000  
d        -0.153433         0.277600  
e        -0.567033         0.277600  
f        -0.538533        -0.538533  
g        -0.591667         0.118200  
h        -0.591667         0.118200  
i        -0.591667         0.118200  
o        -0.591667         0.118200  
j        -0.564433         0.118200  
k        -0.564433         0.118200  
l        -0.171133         0.288500  
m        -0.081400         0.387400  
n        -0.081400         0.387400

But what if there are duplicated values in df2 and the result should return original index. — Jack, Nov 01 '20 at 01:37
@Jack It returns correctly if duplicate dates, and I've included original index by adjusting the merge to merge on index and column with `df2 = df2.merge(pd.concat([s0,s1,s2,s3], axis=1), right_index=True, left_on='obs_date')` instead of using `reset_index()` and merging on column and column. — David Erickson, Nov 01 '20 at 19:52

Return the most recent three days of values from another dataframe with missing dates

1 Answers1