How to resample a dataframe ONLY when time range is too long?

Question

I have a simple DataFrame like this:

timestamp	Power
29/08/2021 02:30:16	155
29/08/2021 02:45:19	151
29/08/2021 03:00:14	155
29/08/2021 03:30:12	152
29/08/2021 04:00:12	149
29/08/2021 04:15:09	152
29/08/2021 04:30:16	153
29/08/2021 04:45:09	211
29/08/2021 05:30:19	77

So these data should be measured every 15 minutes, but for some reason some measurements have been skipped. I want to add the missing timestamps followed by a "NaN" when the measurement is skipped. I know that this can be done by the function "resample" but it's important to use it only when needed. So what I need is to add a condition to that function: I want to resample only between those rows that are (for example) more than 16 minutes distant from each other. In this way, when I don't need to resample, the timestamps will still be the original ones, and this is very important for my work. So what I want to obtain is, more or less:

timestamp	Power
29/08/2021 02:30:16	155
29/08/2021 02:45:19	151
29/08/2021 03:00:14	155
29/08/2021 03:15:00	NaN
29/08/2021 03:30:12	152
29/08/2021 03:45:00	NaN
29/08/2021 04:00:12	149
29/08/2021 04:15:09	152
29/08/2021 04:30:16	153
29/08/2021 04:45:09	211
29/08/2021 05:00:00	NaN
29/08/2021 05:15:00	NaN
29/08/2021 05:30:19	77

Antton Lamarca · Answer 1 · 2022-02-24T16:03:38.887

Okay, so this was trickier than I expected, but I think I figured it out. Here's my solution:

I created a toy example of your df (please provide this code yourself next time, like stated here for example)

import pandas as pd
import datetime

df = pd.DataFrame()
df['timestamp'] = ['29/08/2021 02:30:16', '29/08/2021 02:45:19', '29/08/2021 03:00:14', '29/08/2021 03:30:12']
df['Power'] = [155,151,155,152]

The df looks like this:

   timestamp              Power
0  29/08/2021 02:30:16    155
1  29/08/2021 02:45:19    151
2  29/08/2021 03:00:14    155
3  29/08/2021 03:30:12    152

First we convert the timestamp column into a pandas datetime object, and then replace the axis of the dataframe with it.

df.timestamp = pd.to_datetime(df.timestamp)
df.set_index('timestamp', inplace=True)

This allows us to use resample on it, but as you have already noticed this will create a completely new range of dates instead of incorporating your own. The way I went around this was by using resample only for each pair of contiguous timestamps. This way it will only add new entries if there is "space" for it between the timestamps.

final_df = pd.DataFrame()
timestamp_list = []
power_list = []
for i, timestamp in enumerate(df.index.to_list()):
    temp_df = df[i:i+2].resample('16Min', origin='start').asfreq()
    timestamp_list.extend(temp_df.index.to_list())
    power_list.extend(temp_df.Power.to_list())
final_df['timestamp'] = timestamp_list
final_df['Power'] = power_list

The result looks like this:

  timestamp            Power
0 2021-08-29 02:30:16  155.0
1 2021-08-29 02:45:19  151.0
2 2021-08-29 03:00:14  155.0
3 2021-08-29 03:15:14    NaN
4 2021-08-29 03:30:12  152.0

If you want to re-format the date to the exact same format you had before, I recommend looking into the datetime package. Or you can just do it manually by iterating through the column.

This does not yield rounded time stamps for the empty fields and on top of that, the for loop is very inefficient because of the concat which is done as many times as you have time stamps. Concat basically creates a whole new data frame in memory. So I would not recommend it in this way — Eelco van Vliet, Feb 24 '22 at 15:40
Thanks @Antton, sorry for how I've written my question, it's my first time :) Anyway there's a little problem in your solution: basically now we have two measurements very close to each other (02:45:16 and 02:45:19), so it's like a superposition where the first one is "NaN (added from us) and the second one is the real value (151). Among these two, I would need to keep only the second one. This situation happens not only because of your code, but mainly because it already appears in the original DataFrame: sometimes timestamps are too close, and I want to keep only the last one. How to do it? — Lorenzo Siboni, Feb 24 '22 at 15:51
Very good points. The call to 'concat' can be avoided entirely by just filling two lists with the values instead. The overlap of timestamps can be avoided by resampling with 16 minutes instead of 15, which I should have done from the beginning. I've edited the answer with these changes. — Antton Lamarca, Feb 24 '22 at 16:03

Eelco van Vliet · Accepted Answer · 2022-02-24T17:34:28.260

To reproduce your data I have done:

import pandas as pd
data = pd.DataFrame.from_records(
    [
        ["29/08/2021 02:30:16", 155],
        ["29/08/2021 02:45:19", 151],
        ["29/08/2021 02:47:19", 152],
        ["29/08/2021 03:00:14", 155],
        ["29/08/2021 03:30:12", 152],
        ["29/08/2021 04:00:12", 149],
        ["29/08/2021 04:15:09", 152],
        ["29/08/2021 04:30:16", 153],
        ["29/08/2021 04:45:09", 211],
        ["29/08/2021 05:30:19", 77]
    ],
    columns=["timestamp", "Power"],
)
data["timestamp"] = pd.to_datetime(data["timestamp"])

In order to fill the gaps I have done the following steps.

First, create a new column with your rounded times stamps:

data["t_rounded"] = data["timestamp"].dt.round("15min")
data.set_index("t_rounded", inplace=True, drop=True)

Make sure there are no duplicated indices by dropping all duplicates and keeping only the first sample:

# drop any duplicated samples which occurred too close
is_duplicate = data.index.duplicated(keep='last')
# keep the duplicates which we are going to remove
duplicates_df = data[is_duplicate]

# remove the duplicates from the original data
data = data[~is_duplicate]

Then, create a new desired equidistant index:

new_index = pd.period_range(data.index.values[0], data.index.values[-1], freq="15min")
new_index = new_index.to_timestamp()

Now use the new index for your data frame:

data = data.reindex(new_index)
data.reset_index(inplace=True)

Next, impose the rounded timestamps (now called index because of the reset_index) to empty times

mask = data["timestamp"].isna()
data.loc[mask, "timestamp"] = data.loc[mask, "index"]

Finally, set the new filled timestamps as index and drop the rounded time column

data.set_index("timestamp", inplace=True, drop=True)
data.drop("index", inplace=True, axis=1)

In case needed, you can add the duplicated time stamps which we removed earlier by doing:

df = duplicates_df.reset_index().set_index("timestamp", drop=True).drop("t_rounded", axis=1)
data = pd.concat([data, df]).sort_index()

The final result looks like

                     Power
timestamp                 
2021-08-29 02:30:16  155.0
2021-08-29 02:45:19  151.0
2021-08-29 02:47:19  152.0
2021-08-29 03:00:14  155.0
2021-08-29 03:15:00    NaN
2021-08-29 03:30:12  152.0
2021-08-29 03:45:00    NaN
2021-08-29 04:00:12  149.0
2021-08-29 04:15:09  152.0
2021-08-29 04:30:16  153.0
2021-08-29 04:45:09  211.0
2021-08-29 05:00:00    NaN
2021-08-29 05:15:00    NaN
2021-08-29 05:30:19   77.0

Thanks @Eelco, your code perfectly works in this case. But I have a problem: in the same Dataframe it's also possible that two timestamps are too close to each other (maybe few minutes or few seconds distant). So, when I tried your code on another portion of the DataFrame where this happens, the code "dt.round" will round up more than one timestamp to the same timing. So the function "reindex", in this case, will give the error "cannot reindex from a duplicate axis". How can i solve this issue? — Lorenzo Siboni, Feb 24 '22 at 16:07
This happens because in that case, the round method yields the same time value for two subsequent samples. Duplicated indices are not allowed with reindex. The most straightforward way it just to drop one of the two samples. I will update the example to show that. — Eelco van Vliet, Feb 24 '22 at 16:24
Just one last thing @Eelco, when you say "Make sure there are no duplicated indices by dropping all duplicates and keeping only the first sample", is there a way to do exactly that thing but keeping the LAST sample instead? How would that line of the code change? Thank you so much in advance for your help — Lorenzo Siboni, Feb 24 '22 at 16:51
No problem, just add keep='last' to the duplicated method. In case you add the removed duplicated samples at the end, the final result is the same — Eelco van Vliet, Feb 24 '22 at 17:36
You have been so gentle @Eelco, thank you sincerely for your patience — Lorenzo Siboni, Feb 24 '22 at 17:53

How to resample a dataframe ONLY when time range is too long?

2 Answers2