5
data['rolling_avg_val'] = 0
future_window = '1h'
for i in range(data.shape[0]):
        start_data_idx = data.index[i]
        end_data_idx = start_data_idx + pd.Timedelta(future_window)
        temp_avg = data['values'][start_data_idx:end_data_idx].mean()
        if temp_avg == 0:
            continue
        data.loc[start_data_idx,'rolling_avg_val'] = temp_avg

The dataframe is about 5,000,000 rows representing data over the course of two weeks and looks like this, where the index is timestamps in milliseconds:

2017-04-12 12:19:04.987   0
2017-04-12 12:19:05.157   1
2017-04-12 12:19:05.297   0
2017-01-12 12:19:05.330   0
2017-04-12 12:19:05.487   0
2017-04-12 12:19:05.530   1
2017-04-12 12:19:05.640   0
2017-04-12 12:19:05.703   1

As you can see, the timestamps are not evenly spaced and the end_data_idx is not necessarily in the dataframe, but pandas is able to get the proper ranges when slicing. A common solution that I see is to fill in the missing time intervals and then shift the results. However, this makes the memory explode, so I am trying to avoid that.

On my machine, it looks like it'll do about 800 rows per second. Is there another way to think about this problem to get a faster solution?

edit:

Expected output:

                          rolling_avg
2017-04-12 12:19:04.987   0.375
2017-04-12 12:19:05.157   0.429
2017-04-12 12:19:05.297   0.333
2017-01-12 12:19:05.330   0.4
2017-04-12 12:19:05.487   0.5
2017-04-12 12:19:05.530   0.667
2017-04-12 12:19:05.640   0.5
2017-04-12 12:19:05.703   1.0
user4446237
  • 636
  • 8
  • 21

1 Answers1

7

reverse df

flip index sign (from monotone decreasing to increasing

rolling.mean()

flip index sign again

reverse df again

df2 = df[::-1]
df2.index = pd.datetime(2050,1,1) - df2.index
df2 = df2.rolling('1H').mean()
df3 = df2[::-1]
df3.index = df.index
James Kang
  • 389
  • 1
  • 3
  • 8
  • @James Kang, for the line of code df2 = df2.rolling('1H').mean(), can i update the part ".mean()" to be a self-defined function? For example, i defined a function to find the log sum of the maximum and minimum value in each window. thanks – jester Jan 18 '19 at 03:13