quick background:
- this can relate to pandas
rolling
,resample
,asfreq
,fillna
- this is based on processing timeseries data so I want to use pandas
offset
(e.g., '1T', '5min', etc.) as an input to methods. - also, I'm applying forward looking window for
rolling
method so there is this reverse of dataframe applied in the code.
what I want to achieve:
- basically rolling window statistics calculation and resampling done in a way shown below. In other words, 5 minute rolling window (e.g., rolling window for 07:00 => 07:00-07:05) with 1 minute increment of statistics (e.g., min, max, etc.) calculation.
- input data shown below
ts var1
2022-07-12 07:00:00+00:00 NaN
2022-07-12 07:01:53+00:00 1.5
2022-07-12 07:03:17+00:00 2.2
2022-07-12 07:04:02+00:00 0.9
2022-07-12 07:04:59+00:00 3.6
2022-07-12 07:05:00+00:00 NaN
2022-07-12 07:06:22+00:00 3.3
2022-07-12 07:09:46+00:00 2.3
2022-07-12 07:10:00+00:00 NaN
2022-07-12 07:11:22+00:00 1.3
2022-07-12 07:13:44+00:00 4.3
2022-07-12 07:14:26+00:00 4.1
2022-07-12 07:15:00+00:00 NaN
- what I want
ts var1_max var1_min
2022-07-12 07:00:00+00:00 3.6 0.9
2022-07-12 07:01:00+00:00 3.6 0.9
2022-07-12 07:02:00+00:00 3.6 0.9
2022-07-12 07:03:00+00:00 3.6 0.9
2022-07-12 07:04:00+00:00 3.6 0.9
2022-07-12 07:05:00+00:00 3.3 2.3
2022-07-12 07:06:00+00:00 3.3 2.3
2022-07-12 07:07:00+00:00 2.3 1.3
2022-07-12 07:08:00+00:00 2.3 1.3
2022-07-12 07:09:00+00:00 4.3 1.3
2022-07-12 07:10:00+00:00 4.3 1.3
2022-07-12 07:11:00+00:00 4.3 1.3
2022-07-12 07:12:00+00:00 4.3 4.1
2022-07-12 07:13:00+00:00 4.3 4.1
2022-07-12 07:14:00+00:00 4.1 4.1
2022-07-12 07:15:00+00:00 NaN NaN
- so far, I'm getting
ts var1_max va1_min
2022-07-12 07:00:00+00:00 3.6 0.9
2022-07-12 07:01:00+00:00 3.6 0.9
2022-07-12 07:02:00+00:00 NaN NaN
2022-07-12 07:03:00+00:00 3.6 0.9
2022-07-12 07:04:00+00:00 3.6 0.9
2022-07-12 07:05:00+00:00 3.3 2.3
2022-07-12 07:06:00+00:00 3.3 2.3
2022-07-12 07:07:00+00:00 NaN NaN
2022-07-12 07:08:00+00:00 NaN NaN
2022-07-12 07:09:00+00:00 4.3 1.3
2022-07-12 07:10:00+00:00 4.3 1.3
2022-07-12 07:11:00+00:00 4.3 1.3
2022-07-12 07:12:00+00:00 NaN NaN
2022-07-12 07:13:00+00:00 4.3 4.1
2022-07-12 07:14:00+00:00 4.1 4.1
2022-07-12 07:15:00+00:00 NaN NaN
key issues:
- as you can see from the data that I'm getting, I can't seem to correctly fill values for those spaces with
NaN
- basically, those
NaN
should be informed by rolling window calculation from the raw data - I tried doing
resample
(orasfreq
) before the rolling window calculation, but that loses some valuable information along the process when I have two values within 1 minute time frame. For example, because the input data has values of 3.6 and 0.9 between 07:04-07:05, doing resampling will not capture min (=0.9) and max (=3.6) properly at the end.
example code:
#########################################################################
# input data
data_test = [
[
"01:00:00",
"01:01:53",
"01:03:17",
"01:04:02",
"01:04:59",
"01:05:00",
"01:06:22",
"01:09:46",
"01:10:00",
"01:11:22",
"01:13:44",
"01:14:26",
"01:15:00"
],
[np.nan, 1.5, 2.2, 0.9, 3.6, np.nan, 3.3, 2.3, np.nan, 1.3, 4.3, 4.1, np.nan]
]
data_test = pd.DataFrame(data_test).T
data_test.columns = ['ts', 'var1']
data_test['var1'] = data_test['var1'].astype(float)
data_test['ts'] = pd.to_datetime(data_test.ts)
data_test = data_test.set_index('ts')
#########################################################################
# reversing for forward looking window operation
data_test = data_test[::-1]
#########################################################################
# adding rolling window statistics: minimum
mins = (
data_test.rolling(
window="5min",
min_periods=1,
closed='right'
).min().add_suffix("_min")
)
# adding rolling window statistics: maximum
maxs = (
data_test.rolling(
window="5min",
min_periods=1,
closed='right'
).max().add_suffix("_max")
)
#########################################################################
# resampling with even interval
mins = mins.resample(rule="1min").min()
maxs = maxs.resample(rule="1min").max()
data_test = pd.concat([mins, maxs], axis=1)