If the time series has a constant frequency:
You could compute the number of 2 second interals in 8 days:
window_size = pd.Timedelta('8D')/pd.Timedelta('2min')
and then use pd.rolling_std
with window=window_size
:
import pandas as pd
import numpy as np
np.random.seed(1)
index = pd.date_range(start='2010-01-20 5:00', end='2010-05-20 17:00', freq='2T')
N = len(index)
df = pd.DataFrame({'val': np.random.random(N)}, index=index)
# the number of 2 second intervals in 8 days
window_size = pd.Timedelta('8D')/pd.Timedelta('2min') # 5760.0
df['std'] = pd.rolling_std(df['val'], window=window_size)
print(df.tail())
yields
val std
2010-05-20 16:52:00 0.768918 0.291137
2010-05-20 16:54:00 0.486348 0.291098
2010-05-20 16:56:00 0.679610 0.291099
2010-05-20 16:58:00 0.951798 0.291114
2010-05-20 17:00:00 0.059935 0.291109
To resample this time series so as to get one value per day, you could use the resample
method and aggregate the values by taking the mean:
df['std'].resample('D', how='mean')
yields
...
2010-05-16 0.289019
2010-05-17 0.289988
2010-05-18 0.289713
2010-05-19 0.289269
2010-05-20 0.288890
Freq: D, Name: std, Length: 121
Above, we computed the rolling standard deviation and then resampled to a time
series with daily frequency.
If we were to resample the original data to daily frequency first and then
compute the rolling standard deviation then in general the result would be
different.
Note also that your data looks like it has quite a bit of variation within each
day, so resampling by taking the mean might (wrongly?) hide that variation.
So it is probably better to compute the std first.
If the time series does not have a constant frequency:
If you have enough memory, I think the easiest way to deal with this situation
is to use asfreq
to expand the time series to one that has a constant
frequency.
import pandas as pd
import numpy as np
np.random.seed(1)
# make an example df
index = pd.date_range(start='2010-01-20 5:00', end='2010-05-20 17:00', freq='2T')
N = len(index)
df = pd.DataFrame({'val': np.random.random(N)}, index=index)
mask = np.random.randint(2, size=N).astype(bool)
df = df.loc[mask]
# expand the time series, filling in missing values with NaN
df = df.asfreq('2T', method=None)
# now we can use the constant-frequency solution
window_size = pd.Timedelta('8D')/pd.Timedelta('2min')
df['std'] = pd.rolling_std(df['val'], window=window_size, min_periods=1)
result = df['std'].resample('D', how='mean')
print(result.head())
yields
2010-01-20 0.301834
2010-01-21 0.292505
2010-01-22 0.293897
2010-01-23 0.291018
2010-01-24 0.290444
Freq: D, Name: std, dtype: float64
The alternative to expanding the time series is to write code to compute the
correct sub-Series for each 8-day window. While this is possible, the fact that
you would have to compute this for each row of the time series could make this
method very slow. Thus, I think the faster approach is to expand the time
series.