0

I'm trying to find the nearest datetime index of my table. I'm using this post as a starting point, and am using this MWE:

import os
import numpy as np
import pandas as pd
from datetime import datetime, date, timedelta

df = pd.DataFrame() 
df['datetime'] = pd.date_range(start='2019-01-01', end='2021-01-01', freq='H')
df = df.set_index('datetime')

df['year'] = pd.DatetimeIndex(df.index).year
df['mnth'] = pd.DatetimeIndex(df.index).month
df['day'] = pd.DatetimeIndex(df.index).day
df['dow'] = pd.DatetimeIndex(df.index).dayofweek # Mon=0, ..., Sun=6
df['hour'] = pd.DatetimeIndex(df.index).hour

years = df.year.unique()

idxlist = []

for y in years:
    idx1 = df.loc[((df.year==y) & (df.mnth==4) & (df.day<=7) & (df.dow==6) & (df.hour==2))]
    #idx1 = df.iloc[df.get_loc(((df.year==y) & (df.mnth==4) & (df.day<=7) & (df.dow==6) & (df.hour==2)), method='nearest')]
    idxlist.append(idx1)

Edit based on Michael Delgado comments:

I have several years' worth of daily data, including for the correct days (first Sunday of April in every year).

Even though this works with my MWE, my actual dataset contains missing data and there may not be data for exactly 2am. Data is spaced roughly 20-35min intervals, so the closest value should be less than 15min away from the 2AM target.

I want to find the nearest datetime to 2am in the first Sunday in April. This is for every year in the DataFrame, but I'm not sure how to do this.

Medulla Oblongata
  • 3,771
  • 8
  • 36
  • 75
  • 1
    I would suggest creating a column subtracting the 2am timestamp from the normal timestamp, sorting by absolute value, and taking the max – Ian Wright Aug 02 '21 at 00:31
  • Do you want the date which is closest to 2 AM on the first sunday in April in any year? Or just April 5, 2020? – Michael Delgado Aug 02 '21 at 00:40
  • @Michael Delgado . in every year in the `df`, thanks – Medulla Oblongata Aug 02 '21 at 00:50
  • Got it. So we should ignore the part of your example that sets the year as `max(df.year)`? If so, can you clarify your goal a bit more in the question? – Michael Delgado Aug 02 '21 at 00:57
  • does the data always have an observation for the correct day, it just might not be exactly at 2 AM? or could the closest value be +/- several days? or several months? – Michael Delgado Aug 02 '21 at 04:41
  • I definitely have data for the correct days (first Sunday of April in every year). I have data for every day at roughly 20-35min intervals, so the closest value should be less than 15min away from the 2AM target. – Medulla Oblongata Aug 02 '21 at 18:57
  • oh! that definitely changes things. can you update the question? – Michael Delgado Aug 03 '21 at 22:47

2 Answers2

1

This is a bit of a challenge, just because "the first sunday of April in any year" takes a few steps to compute. You could approach this in a few ways, but I'll approach this by first computing the nearest target date in the year of the target date, as well as the following year (because April is always in the first part of the year, so the date will never be closer to the April in the preceding year), then finding the minimum absolute difference to either of the targets.

First step, I'll expand on your MWE with some random offsets (+/- 30 mins) and a longer time series. I also added a value column so the df displays as a frame:

In [26]: df = pd.DataFrame(
    ...:     {'val': np.arange(24*366*10)},
    ...:     index=(
    ...:         pd.date_range('2010-01-01', periods=24*366*10, freq='H')
    ...:         + pd.to_timedelta(np.random.randint(-30, 30, size=(24*366*10)), unit='minutes')
    ...:     ),
    ...: )

In [27]: df
Out[27]:
                       val
2010-01-01 00:29:00      0
2010-01-01 01:09:00      1
2010-01-01 01:43:00      2
2010-01-01 03:14:00      3
2010-01-01 03:54:00      4
...                    ...
2020-01-08 18:31:00  87835
2020-01-08 20:21:00  87836
2020-01-08 20:54:00  87837
2020-01-08 21:47:00  87838
2020-01-08 23:11:00  87839

Next, I find the date of the first Sunday in April (at 2 AM) for the year of each row:

In [28]: apr1 = pd.to_datetime({'year': df.index.year, 'month': 4, 'day': 1, 'hour': 2})

In [29]: apr_first_sun = apr1 + pd.to_timedelta(6 - apr1.dt.weekday, unit='day')

In [30]: apr_first_sun
Out[30]:
0       2010-04-04 02:00:00
1       2010-04-04 02:00:00
2       2010-04-04 02:00:00
3       2010-04-04 02:00:00
4       2010-04-04 02:00:00
                ...
87835   2020-04-05 02:00:00
87836   2020-04-05 02:00:00
87837   2020-04-05 02:00:00
87838   2020-04-05 02:00:00
87839   2020-04-05 02:00:00
Length: 87840, dtype: datetime64[ns]

In [31]: apr1 = pd.to_datetime({'year': df.index.year + 1, 'month': 4, 'day': 1, 'hour': 2})

In [32]: next_apr_first_sun = apr1 + pd.to_timedelta(6 - apr1.dt.weekday, unit='day')

Next, find the closer absolute difference:

In [36]: nearer_abs_diff = np.minimum(abs(df.index - apr_first_sun.values), abs(df.index - next_apr_first_sun.values))

In [37]: nearer_abs_diff
Out[37]:
TimedeltaIndex(['93 days 01:31:00', '93 days 00:51:00', '93 days 00:17:00',
                '92 days 22:46:00', '92 days 22:06:00', '92 days 20:54:00',
                '92 days 20:23:00', '92 days 19:25:00', '92 days 18:12:00',
                '92 days 16:48:00',
                ...
                '87 days 12:19:00', '87 days 11:12:00', '87 days 09:36:00',
                '87 days 08:31:00', '87 days 07:36:00', '87 days 07:29:00',
                '87 days 05:39:00', '87 days 05:06:00', '87 days 04:13:00',
                '87 days 02:49:00'],
               dtype='timedelta64[ns]', length=87840, freq=None

Finally, find the positional index of the minimum absolute difference and use that to index into the dataframe:

In [38]: idx = np.argmin(nearer_abs_diff)

In [39]: df.iloc[idx]
Out[39]:
val    37346
Name: 2014-04-06 02:14:00, dtype: int64
Michael Delgado
  • 13,789
  • 3
  • 29
  • 54
  • Thanks for the detailed answer... it's not that challenging to find the first Sunday in April though - it's always between 1 and 7 April (hence why I wrote `df.day<=7`). So is there a way to simplify your code? – Medulla Oblongata Aug 02 '21 at 01:29
  • I'm finding the exact timestamp of each year's first sunday in April, then computing the difference relative to that value. So I need to calculate when it is, not just filter the values for matches. – Michael Delgado Aug 02 '21 at 01:34
  • I'm not sure I understand your answer. `df.iloc[idx]` returns one value, but I want it to give the index for every April between the first and last years. – Medulla Oblongata Aug 02 '21 at 02:12
  • Oh! I thought you wanted the one that comes the closest of any value across all years. In this case, you could groupby year and find the argmin in each year. – Michael Delgado Aug 02 '21 at 02:43
1

Based on your comments, it seems that you can rely on always having data within an hour of your desired time (1st Sunday of April) in each year. In this case, you can take a simpler approach.

Using an example dataset with variation in the times:

In [4]: df = pd.DataFrame(
   ...:     ...:     {'val': np.arange(24*366*10)},
   ...:     ...:     index=(
   ...:     ...:         pd.date_range('2010-01-01', periods=24*366*10, freq='H')
   ...:     ...:         + pd.to_timedelta(np.random.randint(-30, 30, size=(24*366*10)), unit='minutes')
   ...:     ...:     ),
   ...:     ...: )

In [5]: df
Out[5]:
                       val
2010-01-01 00:14:00      0
2010-01-01 01:20:00      1
2010-01-01 01:46:00      2
2010-01-01 03:20:00      3
2010-01-01 03:51:00      4
...                    ...
2020-01-08 18:48:00  87835
2020-01-08 19:46:00  87836
2020-01-08 21:07:00  87837
2020-01-08 22:06:00  87838
2020-01-08 23:11:00  87839

[87840 rows x 1 columns]

We can filter based on times rounded to the nearest 2 hours:

within_an_hour = df[
    (df.index.month==4)
    & (df.index.day<=7)
    & (df.index.day_of_week == 6)
    & (df.index.round('2H').hour == 2)
]

We can then select the closest indices by taking the minimum absolute difference to the 2-hour rounded value for each year:

In [15]: closest_indices = (
    ...:     within_an_hour
    ...:     .groupby(within_an_hour.index.year)
    ...:     .apply(
    ...:         lambda x: x.index.values[np.argmin(abs(x.index - x.index.round('2H')))]
    ...:     )
    ...: )

In [16]: closest_indices
Out[16]:
2010   2010-04-04 02:17:00
2011   2011-04-03 02:22:00
2012   2012-04-01 01:49:00
2013   2013-04-07 01:39:00
2014   2014-04-06 02:01:00
2015   2015-04-05 01:58:00
2016   2016-04-03 02:12:00
2017   2017-04-02 01:54:00
2018   2018-04-01 02:22:00
2019   2019-04-07 02:13:00
dtype: datetime64[ns]
Michael Delgado
  • 13,789
  • 3
  • 29
  • 54