Fill Gaps in time series pandas dataframe

Question

I have a pandas dataframe with gaps in time series.
It looks like the following:

Example Input

--------------------------------------
     Timestamp        Close
 2021-02-07 09:30:00  124.624 
 2021-02-07 09:31:00  124.617
 2021-02-07 10:04:00  123.946
 2021-02-07 16:00:00  123.300
 2021-02-09 09:04:00  125.746
 2021-02-09 09:05:00  125.646
 2021-02-09 15:58:00  125.235
 2021-02-09 15:59:00  126.987
 2021-02-09 16:00:00  127.124

Desired Output

--------------------------------------------
     Timestamp        Close
 2021-02-07 09:30:00  124.624 
 2021-02-07 09:31:00  124.617
 2021-02-07 09:32:00  124.617
 2021-02-07 09:33:00  124.617
   'Insert a line for each minute up to the next available
   timestamp with the Close value form the last available timestamp'
 2021-02-07 10:03:00  124.617 
 2021-02-07 10:04:00  123.946
 2021-02-07 16:00:00  123.300
   'I dont want lines inserted here. As this date is not
   present in the original dataset (could be a non trading
   day so I dont want to fill this gap)'
 2021-02-09 09:04:00  125.746
 2021-02-09 09:05:00  125.646
 2021-02-09 15:58:00  125.235
   'Fill the gaps here again but only between 09:30 and 16:00 time'
 2021-02-09 15:59:00  126.987
 2021-02-09 16:00:00  127.124

What I have tried is:

'# set the index column'
df_process.set_index('Exchange DateTime', inplace=True)

'# resample and forward fill the gaps'
df_process_out = df_process.resample(rule='1T').ffill()

'# filter and return only timestamps between 09:30 and 16:00'
df_process_out = df_process_out.between_time(start_time='09:30:00', end_time='16:00:00')

However if I do it like this it also resamples and generates new timestamps on dates that are not existent in the original dataframe. In the example above it would also generate timestamps on a minute basis for 2021-02-08

How can I avoid this?

Furthermore is there a better way to avoid resampling over the whole time.

df_process_out = df_process.resample(rule='1T').ffill()

This generates timestamps from 00:00 to 24:00 and in the next line of code I have to filter most timestamps out again. Doesn't seem efficient.

Any help/guidance would be highly appreciated
Thanks

Edit:
As requested a small sample set

df_in: Input data
df_out_error: Wrong Output Data
df_out_OK: How the output data should look like

In the following ColabNotebook I prepeared a small sample.

https://colab.research.google.com/drive/1Fps2obTv1YPDpTzXTo7ivLI5njoI-y4n?usp=sharing

Notice that this is only a small subset of the data. I'm trying to clean multiple years of data that is structured and shows missing minutes timestamps like this.

Kindly create a small reproducible dataframe with a complete expected output dataframe — sammywemmy, Sep 16 '21 at 10:17
Any reason why you dont want rows inserted between ` 2021-02-07 10:04:00` and `2021-02-07 16:00:00`? or is that supposed to be filled for each minute too? — Akshay Sehgal, Sep 16 '21 at 13:41
Sorry for beeing unclear. Yes this should also be filled with 1 Minute (or other Intervall) timestamps. — Chris Bauer, Sep 16 '21 at 13:53
Please test the code I mention below. that should solve your problem. — Akshay Sehgal, Sep 16 '21 at 14:06
it should solve both the concerns you have, resampling for limited time period, AND applying resample over existing dates only. — Akshay Sehgal, Sep 16 '21 at 14:15

Akshay Sehgal · Accepted Answer · 2021-09-17T01:41:58.763

You can achieve what you need with a combination of df.groupby() (over dates) and resampling using rule = "1Min". Try this -

df_new = (df.assign(date=df.Timestamp.dt.date)   #create new col 'date' from the timestamp
            .set_index('Timestamp')              #set timestamp as index
            .groupby('date')                     #groupby for each date
            .apply(lambda x: x.resample('1Min')  #apply resampling for 1 minute from start time to end time for that date
                   .ffill())                     #ffill values
            .reset_index('date', drop=True)      #drop index 'date' that was created by groupby
            .drop('date',1)                      #drop 'date' column created before
            .reset_index()                       #reset index to get back original 2 cols
         )

df_new

Explanation

1. Resampling for limited time period only

"Furthermore is there a better way to avoid resampling over the whole time. This generates timestamps from 00:00 to 24:00 and in the next line of code I have to filter most timestamps out again. Doesn't seem efficient."

As in the above solution, you can resample and then ffill (or any other type of fill) using rule = 1Min. This ensures that you are not resampling from 00:00 to 24:00 but only from the start to end time stamps available in your data. To prove, I show this applied to a single date in the data -

#filtering for a single day
ddd = df[df['date']==df.date.unique()[0]]

#applying resampling on that given day
ddd.set_index('Timestamp').resample('1Min').ffill()

Notice the start (09:30:00) and end (16:00:00) timestamps for the given date.

2. Applying resample over existing dates only

"In the example above it would also generate timestamps on a minute basis for 2021-02-08. How can I avoid this?"

As in the above solution, you can apply the resampling method over date groups separately. In this case, I apply the method using a lambda function after separating out the date from the timestamps. So the resample happens only for the date that exist in the dataset

df_new.Timestamp.dt.date.unique()

array([datetime.date(2021, 2, 7), datetime.date(2021, 2, 9)], dtype=object)

Notice, that the output only contains the 2 unique dates from the original dataset.

Thank you this was a very good explanation! – Chris Bauer Sep 17 '21 at 09:40 — Chris Bauer, Sep 17 '21 at 09:40

Fill Gaps in time series pandas dataframe

1 Answers1

Explanation

1. Resampling for limited time period only

2. Applying resample over existing dates only

Linked