2

quick background:

  • this can relate to pandas rolling, resample, asfreq, fillna
  • this is based on processing timeseries data so I want to use pandas offset (e.g., '1T', '5min', etc.) as an input to methods.
  • also, I'm applying forward looking window for rolling method so there is this reverse of dataframe applied in the code.

what I want to achieve:

  • basically rolling window statistics calculation and resampling done in a way shown below. In other words, 5 minute rolling window (e.g., rolling window for 07:00 => 07:00-07:05) with 1 minute increment of statistics (e.g., min, max, etc.) calculation.
  • input data shown below
ts                          var1
2022-07-12 07:00:00+00:00   NaN
2022-07-12 07:01:53+00:00   1.5
2022-07-12 07:03:17+00:00   2.2
2022-07-12 07:04:02+00:00   0.9
2022-07-12 07:04:59+00:00   3.6
2022-07-12 07:05:00+00:00   NaN
2022-07-12 07:06:22+00:00   3.3
2022-07-12 07:09:46+00:00   2.3
2022-07-12 07:10:00+00:00   NaN
2022-07-12 07:11:22+00:00   1.3
2022-07-12 07:13:44+00:00   4.3
2022-07-12 07:14:26+00:00   4.1
2022-07-12 07:15:00+00:00   NaN
  • what I want
ts                          var1_max    var1_min
2022-07-12 07:00:00+00:00   3.6         0.9
2022-07-12 07:01:00+00:00   3.6         0.9
2022-07-12 07:02:00+00:00   3.6         0.9
2022-07-12 07:03:00+00:00   3.6         0.9
2022-07-12 07:04:00+00:00   3.6         0.9
2022-07-12 07:05:00+00:00   3.3         2.3
2022-07-12 07:06:00+00:00   3.3         2.3
2022-07-12 07:07:00+00:00   2.3         1.3
2022-07-12 07:08:00+00:00   2.3         1.3
2022-07-12 07:09:00+00:00   4.3         1.3
2022-07-12 07:10:00+00:00   4.3         1.3
2022-07-12 07:11:00+00:00   4.3         1.3
2022-07-12 07:12:00+00:00   4.3         4.1
2022-07-12 07:13:00+00:00   4.3         4.1
2022-07-12 07:14:00+00:00   4.1         4.1
2022-07-12 07:15:00+00:00   NaN         NaN
  • so far, I'm getting
ts                          var1_max    va1_min
2022-07-12 07:00:00+00:00   3.6         0.9
2022-07-12 07:01:00+00:00   3.6         0.9
2022-07-12 07:02:00+00:00   NaN         NaN
2022-07-12 07:03:00+00:00   3.6         0.9
2022-07-12 07:04:00+00:00   3.6         0.9
2022-07-12 07:05:00+00:00   3.3         2.3
2022-07-12 07:06:00+00:00   3.3         2.3
2022-07-12 07:07:00+00:00   NaN         NaN
2022-07-12 07:08:00+00:00   NaN         NaN
2022-07-12 07:09:00+00:00   4.3         1.3
2022-07-12 07:10:00+00:00   4.3         1.3
2022-07-12 07:11:00+00:00   4.3         1.3
2022-07-12 07:12:00+00:00   NaN         NaN
2022-07-12 07:13:00+00:00   4.3         4.1
2022-07-12 07:14:00+00:00   4.1         4.1
2022-07-12 07:15:00+00:00   NaN         NaN

key issues:

  • as you can see from the data that I'm getting, I can't seem to correctly fill values for those spaces with NaN
  • basically, those NaN should be informed by rolling window calculation from the raw data
  • I tried doing resample (or asfreq) before the rolling window calculation, but that loses some valuable information along the process when I have two values within 1 minute time frame. For example, because the input data has values of 3.6 and 0.9 between 07:04-07:05, doing resampling will not capture min (=0.9) and max (=3.6) properly at the end.

example code:

#########################################################################
# input data
data_test = [
        [
            "01:00:00",
            "01:01:53",
            "01:03:17",
            "01:04:02",
            "01:04:59",
            "01:05:00",
            "01:06:22",
            "01:09:46",
            "01:10:00",
            "01:11:22",
            "01:13:44",
            "01:14:26",
            "01:15:00"
        ],
        [np.nan, 1.5, 2.2, 0.9, 3.6, np.nan, 3.3, 2.3, np.nan, 1.3, 4.3, 4.1, np.nan]
    ]

data_test = pd.DataFrame(data_test).T
data_test.columns = ['ts', 'var1']
data_test['var1'] = data_test['var1'].astype(float)
data_test['ts'] = pd.to_datetime(data_test.ts)
data_test = data_test.set_index('ts')
    
#########################################################################
# reversing for forward looking window operation
data_test = data_test[::-1]
    
#########################################################################
# adding rolling window statistics: minimum
mins = (
    data_test.rolling(
        window="5min", 
        min_periods=1,
        closed='right'
    ).min().add_suffix("_min")
)

# adding rolling window statistics: maximum
maxs = (
    data_test.rolling(
        window="5min", 
        min_periods=1,
        closed='right'
    ).max().add_suffix("_max")
)
    
#########################################################################
# resampling with even interval
mins = mins.resample(rule="1min").min()
maxs = maxs.resample(rule="1min").max()

data_test = pd.concat([mins, maxs], axis=1)   
    
geekshock
  • 31
  • 3

2 Answers2

2

Given the following, where I've already converted ts to datetime and made it the index:

                           var1
ts
2022-07-12 07:00:00+00:00   NaN
2022-07-12 07:01:53+00:00   1.5
2022-07-12 07:03:17+00:00   2.2
2022-07-12 07:04:02+00:00   0.9
2022-07-12 07:04:59+00:00   3.6
2022-07-12 07:05:00+00:00   NaN
2022-07-12 07:06:22+00:00   3.3
2022-07-12 07:09:46+00:00   2.3
2022-07-12 07:10:00+00:00   NaN
2022-07-12 07:11:22+00:00   1.3
2022-07-12 07:13:44+00:00   4.3
2022-07-12 07:14:26+00:00   4.1
2022-07-12 07:15:00+00:00   NaN

Doing:

df['var1_min'] = df.resample('5min')['var1'].transform('min')
df['var1_max'] = df.resample('5min')['var1'].transform('max')
df = df.asfreq('1min', method='ffill')

Output:

                           var1  var1_min  var1_max
2022-07-12 07:00:00+00:00   NaN       0.9       3.6
2022-07-12 07:01:00+00:00   NaN       0.9       3.6
2022-07-12 07:02:00+00:00   1.5       0.9       3.6
2022-07-12 07:03:00+00:00   1.5       0.9       3.6
2022-07-12 07:04:00+00:00   2.2       0.9       3.6
2022-07-12 07:05:00+00:00   NaN       2.3       3.3
2022-07-12 07:06:00+00:00   NaN       2.3       3.3
2022-07-12 07:07:00+00:00   3.3       2.3       3.3
2022-07-12 07:08:00+00:00   3.3       2.3       3.3
2022-07-12 07:09:00+00:00   3.3       2.3       3.3
2022-07-12 07:10:00+00:00   NaN       1.3       4.3
2022-07-12 07:11:00+00:00   NaN       1.3       4.3
2022-07-12 07:12:00+00:00   1.3       1.3       4.3
2022-07-12 07:13:00+00:00   1.3       1.3       4.3
2022-07-12 07:14:00+00:00   4.3       1.3       4.3
2022-07-12 07:15:00+00:00   NaN       NaN       NaN
BeRT2me
  • 12,699
  • 2
  • 13
  • 31
  • Did your `var1` column get messed up? – jch Jul 16 '22 at 16:54
  • @jch No, not messed up, that's just what it'd look like after changing the freq to `1min` and front filling. OP could `.bfill()` that column as well if they'd like. – BeRT2me Jul 16 '22 at 17:42
  • awesome! I haven't had a chance to use `transform` so good to be aware of the usage. BUT, if I'm looking at this right this isn't truly the "rolling" window operation. – geekshock Jul 18 '22 at 15:12
  • If you compare your result versus what I want, there is a difference. For example, at 07:07 and 07:08, the var1_max should be 2.3 for both. This is because the rolling window at 07:07 becomes 07:07-07:12 and corresponding values in the input data for this window includes [2.3, NaN, 1.3], so the max of these values becomes 2.3. Similar for rolling window at 07:08, while rolling window is 07:08-07:13, values within this window is also [2.3, NaN, 1.3]. – geekshock Jul 18 '22 at 15:13
  • And this is exactly where I'm having trouble with with the rolling window. So appreciate the attempt! I tried replacing your `resample` with `rolling` but looks like I can't do `transform` on `rolling`. Also been reading/exploring about `agg(['min', 'max'])` thingy at the moment.. – geekshock Jul 18 '22 at 15:15
  • Are you actually trying to `interpolate`? – BeRT2me Nov 20 '22 at 23:45
1

I think I found a solution to generate the output that I want. But unsure if this is the most elegant way. In short, it is basically (1) using resample and agg methods to resample data with respect to each statistic (min, max, sum, count) and (2) apply rolling method to each statistic again to get rolling window statistics. To calculate the rolling mean/average, I'm dividing rolling sum by rolling count.

#########################################################################
# resampling with agg method to retain statistics within time increment
data_resample_min = data_test.resample(rule='1min').agg(['min'])
data_resample_max = data_test.resample(rule='1min').agg(['max'])
data_resample_sum = data_test.resample(rule='1min').agg(['sum'])
data_resample_count = data_test.resample(rule='1min').agg(['count'])

#########################################################################
# reversing dataframe for forward-looking rolling window operation
data_resample_min = data_resample_min[::-1]
data_resample_max = data_resample_max[::-1]
data_resample_sum = data_resample_sum[::-1]
data_resample_count = data_resample_count[::-1]

#########################################################################
# merging multiindex column header
data_resample_min.columns = data_resample_min.columns.map('_'.join)
data_resample_max.columns = data_resample_max.columns.map('_'.join)
data_resample_sum.columns = data_resample_sum.columns.map('_'.join)
data_resample_count.columns = data_resample_count.columns.map('_'.join)
    
#########################################################################   
# adding rolling window statistics: minimum
mins = (
    data_resample_min.rolling(
        window='5min', 
        min_periods=1,
        center=False,
        closed='right'
    ).min()
)
# adding rolling window statistics: maximum
maxs = (
    data_resample_max.rolling(
        window='5min', 
        min_periods=1,
        center=False,
        closed='right'
    ).max()
)
# adding rolling window statistics: sum
sums = (
    data_resample_sum.rolling(
        window='5min', 
        min_periods=1,
        center=False,
        closed='right'
    ).sum()
)
# adding rolling window statistics: count
counts = (
    data_resample_count.rolling(
        window='5min', 
        min_periods=1,
        center=False,
        closed='right'
    ).sum() # this has to be sum for proper count calculation
)
# adding rolling window statistics: mean
means = sums.copy()
means.columns = means.columns.str.replace("_sum","_mean")
# supress/hide the warning
np.seterr(invalid='ignore')
means.loc[:, :] = sums.values/counts.values

#########################################################################
# creating summary dataframe
data_test = pd.concat([mins, maxs, means], axis=1)  
data_test = data_test[::-1]
geekshock
  • 31
  • 3