0

I have a series of timestamps called dates that look as such:

1   2021-04-21 09:34:00+00:00
2   2021-04-21 10:30:02+00:00
3   2021-04-21 15:54:00+00:00
4   2021-04-22 18:33:57+00:00
5   2021-04-23 18:48:04+00:00

I am trying to find the closest date from another series called PublishTime which has minutely data for a 6 day time period covering some of the timestamps. The date must be after the timestamp and cannot be before. An example looks as such:

0   2021-04-21 09:30:00
1   2021-04-21 09:31:00
2   2021-04-21 09:32:00
3   2021-04-21 09:33:00
4   2021-04-21 09:34:00

Is there an easy way to quickly find the closest date? I have looked in the datetime module but cannot find an answer.

EDIT I incorrectly said that the date column covers all the timestamps. In the second series that I am trying to match it to, there is no minute data for weekends and non-business hours, therefore I would like to find the closest date AFTER the timestamp, not before.

gerardcslabs
  • 213
  • 1
  • 3
  • 9

1 Answers1

1

Thanks @Quang Hoang, merge_asof worked. Since it was new to me as well, I tried it out and here's the result.

First get the df from the question and reformat type to match the type in "PublishTime" series

df = pd.DataFrame({'dates': ["2021-04-21 09:34:00+00:00", "2021-04-21 10:30:02+00:00", "2021-04-21 15:54:00+00:00", "2021-04-22 18:33:57+00:00", "2021-04-23 18:48:04+00:00"]})
df['dates'] = pd.to_datetime(df['dates'])
df['dates'] = df['dates'].dt.strftime('%Y-%m-%d %H:%M:%S')
df['dates'] = pd.to_datetime(df['dates'])
df

dates
0   2021-04-21 09:34:00
1   2021-04-21 10:30:02
2   2021-04-21 15:54:00
3   2021-04-22 18:33:57
4   2021-04-23 18:48:04

Get the df in PublishTime series

df2 = pd.DataFrame({'PublishTime': ["2021-04-21 09:33:00", "2021-04-21 09:34:00", "2021-04-21 09:35:00", "2021-04-21 10:31:00", "2021-04-21 15:56:00", "2021-04-25 15:56:00", "2021-04-26 15:56:00"]})
df2['PublishTime'] = pd.to_datetime(df2['PublishTime'])
df2

PublishTime
0   2021-04-21 09:33:00
1   2021-04-21 09:34:00
2   2021-04-21 09:35:00
3   2021-04-21 10:31:00
4   2021-04-21 15:56:00
5   2021-04-25 15:56:00
6   2021-04-26 15:56:00

Finally, merge_asof and use forward as the direction.

pd.merge_asof(df, df2, left_on='dates', right_on='PublishTime', direction='forward')

     dates                  PublishTime
0   2021-04-21 09:34:00 2021-04-21 09:34:00
1   2021-04-21 10:30:02 2021-04-21 10:31:00
2   2021-04-21 15:54:00 2021-04-21 15:56:00
3   2021-04-22 18:33:57 2021-04-25 15:56:00
4   2021-04-23 18:48:04 2021-04-25 15:56:00

As you can see, in the PublishTime series I didn't add the data for 22nd - 24th April to show that some data can be missing (like weekends) and then it took the next closest one on 25th.

Shubham Periwal
  • 2,198
  • 2
  • 8
  • 26