data['rolling_avg_val'] = 0
future_window = '1h'
for i in range(data.shape[0]):
start_data_idx = data.index[i]
end_data_idx = start_data_idx + pd.Timedelta(future_window)
temp_avg = data['values'][start_data_idx:end_data_idx].mean()
if temp_avg == 0:
continue
data.loc[start_data_idx,'rolling_avg_val'] = temp_avg
The dataframe is about 5,000,000 rows representing data over the course of two weeks and looks like this, where the index is timestamps in milliseconds:
2017-04-12 12:19:04.987 0
2017-04-12 12:19:05.157 1
2017-04-12 12:19:05.297 0
2017-01-12 12:19:05.330 0
2017-04-12 12:19:05.487 0
2017-04-12 12:19:05.530 1
2017-04-12 12:19:05.640 0
2017-04-12 12:19:05.703 1
As you can see, the timestamps are not evenly spaced and the end_data_idx is not necessarily in the dataframe, but pandas is able to get the proper ranges when slicing. A common solution that I see is to fill in the missing time intervals and then shift the results. However, this makes the memory explode, so I am trying to avoid that.
On my machine, it looks like it'll do about 800 rows per second. Is there another way to think about this problem to get a faster solution?
edit:
Expected output:
rolling_avg
2017-04-12 12:19:04.987 0.375
2017-04-12 12:19:05.157 0.429
2017-04-12 12:19:05.297 0.333
2017-01-12 12:19:05.330 0.4
2017-04-12 12:19:05.487 0.5
2017-04-12 12:19:05.530 0.667
2017-04-12 12:19:05.640 0.5
2017-04-12 12:19:05.703 1.0