2

I have a question about dealing with the rolling standard deviation:

The data frame looks like this:

2010-01-20 05:00:00   -0.011
2010-01-20 05:02:00   -0.032
2010-01-20 05:02:00   -0.037
2010-01-20 05:04:00    0.001
2010-01-20 05:06:00    0.023
2010-01-20 05:06:00    0.011
2010-01-20 05:08:00    0.049
2010-01-20 05:10:00    0.102
....
2010-05-20 17:00:00    0.022

This is 2-min data from 5am to 5pm (The format of index 'yyyy-mm-dd hh:mm:ss' is datestamp)

I want to calculate the 8-day look-back on the standard deviation. My intuition is to split the data frame into daily data set and then calculate the rolling standard deviation, but I don't know how to deal with these indexand i guess my methods may takes a lot of time to calculate. Thanks a lot for your help!

Finally, I would like the result like this:

2010-01-20   0.0
2010-01-21   0.0
2010-01-22   0.0
....
2010-01-26   0.0
2010-01-27   0.12
2010-01-28   0.02
2010-01-29   0.07
...
2010-05-20   0.10

Thank you for your help. @unutbu

Just found the problem in the data: The data frame is not completely including the whole 2-min data. For example:

2010-01-21 15:08:00    0.044
2010-01-22 05:10:00    0.102

The data ends at 15:08 on 2010-01-21 and start at 05:10:00 on 2010-01-22. so setting window size with a constant may not fixed this problem. Any suggestions? thanks a lot

Julia
  • 111
  • 1
  • 9

1 Answers1

8

If the time series has a constant frequency:

You could compute the number of 2 second interals in 8 days:

window_size = pd.Timedelta('8D')/pd.Timedelta('2min')

and then use pd.rolling_std with window=window_size:

import pandas as pd
import numpy as np
np.random.seed(1)

index = pd.date_range(start='2010-01-20 5:00', end='2010-05-20 17:00', freq='2T')
N = len(index)
df = pd.DataFrame({'val': np.random.random(N)}, index=index)
# the number of 2 second intervals in 8 days
window_size = pd.Timedelta('8D')/pd.Timedelta('2min')    # 5760.0

df['std'] = pd.rolling_std(df['val'], window=window_size)
print(df.tail())

yields

                          val       std
2010-05-20 16:52:00  0.768918  0.291137
2010-05-20 16:54:00  0.486348  0.291098
2010-05-20 16:56:00  0.679610  0.291099
2010-05-20 16:58:00  0.951798  0.291114
2010-05-20 17:00:00  0.059935  0.291109

To resample this time series so as to get one value per day, you could use the resample method and aggregate the values by taking the mean:

df['std'].resample('D', how='mean')

yields

...
2010-05-16    0.289019
2010-05-17    0.289988
2010-05-18    0.289713
2010-05-19    0.289269
2010-05-20    0.288890
Freq: D, Name: std, Length: 121

Above, we computed the rolling standard deviation and then resampled to a time series with daily frequency.

If we were to resample the original data to daily frequency first and then compute the rolling standard deviation then in general the result would be different.

Note also that your data looks like it has quite a bit of variation within each day, so resampling by taking the mean might (wrongly?) hide that variation. So it is probably better to compute the std first.


If the time series does not have a constant frequency:

If you have enough memory, I think the easiest way to deal with this situation is to use asfreq to expand the time series to one that has a constant frequency.

import pandas as pd
import numpy as np
np.random.seed(1)

# make an example df
index = pd.date_range(start='2010-01-20 5:00', end='2010-05-20 17:00', freq='2T')
N = len(index)
df = pd.DataFrame({'val': np.random.random(N)}, index=index)
mask = np.random.randint(2, size=N).astype(bool)
df = df.loc[mask]

# expand the time series, filling in missing values with NaN
df = df.asfreq('2T', method=None)

# now we can use the constant-frequency solution
window_size = pd.Timedelta('8D')/pd.Timedelta('2min')    
df['std'] = pd.rolling_std(df['val'], window=window_size, min_periods=1)

result = df['std'].resample('D', how='mean')
print(result.head())

yields

2010-01-20    0.301834
2010-01-21    0.292505
2010-01-22    0.293897
2010-01-23    0.291018
2010-01-24    0.290444
Freq: D, Name: std, dtype: float64

The alternative to expanding the time series is to write code to compute the correct sub-Series for each 8-day window. While this is possible, the fact that you would have to compute this for each row of the time series could make this method very slow. Thus, I think the faster approach is to expand the time series.

unutbu
  • 842,883
  • 184
  • 1,785
  • 1,677
  • thank you for the comments. I think I got your idea. – Julia Mar 17 '15 at 21:05
  • Hi unutbu, I found the data doesn't include the whole 2-min data set. who should i set the dynamic window size or i should change another way to calcite this rolling standard deviation (I have updated my questions) do you have any suggestion for the new problem? thx. – Julia Mar 17 '15 at 21:57