I am trying to write an implementation of the GAPS operator in Kairos in Python.
What the GAPS operator does is introduces NaN values where there is no data point based on the sampling frequency. ` This an example in a sample dataset:
import pandas as pd
# Sample DataFrame
data = {'timestamp': ['2023-07-20 00:01:30', '2023-07-20 01:50:10', '2023-07-20 01:40:00', '2023-07-20 03:00:00'],
'value': [10, 20, 30, 15]}
df = pd.DataFrame(data)
# Convert the 'timestamp' column to a pandas DateTimeIndex
df['timestamp'] = pd.to_datetime(df['timestamp'])
df.sort_values('timestamp', inplace=True)
df.set_index('timestamp', inplace=True)
# # Create a new DatetimeIndex with 1-hour frequency
# start_time = df.index.min().floor('H') # Round down to the nearest hour
# end_time = df.index.max().ceil('H') # Round up to the nearest hour
start_time = '2023-07-20 00:00:00'
end_time = '2023-07-20 05:00:00'
new_index = pd.date_range(start=start_time, end=end_time, freq='1H')
# # Reindex the DataFrame with the new DatetimeIndex and fill missing values with NaN
df_reindexed = df.reindex(df.index.union(new_index))
print(df_reindexed)
What i got is:
value
2023-07-20 00:00:00 NaN
2023-07-20 00:01:30 10.0
2023-07-20 01:00:00 NaN
2023-07-20 01:40:00 30.0
2023-07-20 01:50:10 20.0
2023-07-20 02:00:00 NaN
2023-07-20 03:00:00 15.0
2023-07-20 04:00:00 NaN
2023-07-20 05:00:00 NaN
What i should get is :
value
2023-07-20 00:00:00 NaN
2023-07-20 00:01:30 10.0
2023-07-20 01:40:00 30.0
2023-07-20 01:50:10 20.0
2023-07-20 02:00:00 NaN
2023-07-20 03:00:00 15.0
2023-07-20 04:00:00 NaN
2023-07-20 05:00:00 NaN
The datapoint '2023-07-20 01:00:00 NaN', should not be present in the result as the sampling frequency is 1 hour and i already have 2 datapoints in the hour already present (2023-07-20 01:40:00, 2023-07-20 01:50:10)
Any help or reference would be appreciated.Cheers.