I illustrate my question with the following example. I have two panda dataframes.
The first with ten second timesteps, which is continuous. Example data for two days:
import pandas as pd
import random
t_10s = pd.date_range(start='1/1/2018', end='1/3/2018', freq='10s')
t_10s = pd.DataFrame(columns = ['b'],
data = [random.randint(0,10) for _ in range(len(t_10s))],
index = t_10s)
The next dataframe have five minute timesteps, but there is only data during daytime, and the logging starts at different times in the morning on each day. Example data for two days, starting at two different times in the morning to resemble the real data:
t_5m1 = pd.date_range(start='1/1/2018 08:08:30', end='1/1/2018 18:03:30', freq='5min')
t_5m2 = pd.date_range(start='1/2/2018 08:10:25', end='1/2/2018 18:00:25', freq='5min')
t_5m = t_5m1.append(t_5m2)
t_5m = pd.DataFrame(columns = ['a'],
data = [0 for _ in range(len(t_5m))],
index = t_5m)
Now what I want to do is for each datapoint, x, in t_5m, to find the equivalent average of the t_10s data, in a five minute window surrounding x.
Now, I have found a way to do this with a list-comprehension as follows:
tstep = pd.to_timedelta(2.5, 'm')
t_5m['avg'] = [t_10s.loc[((t_10s.index >= t_5m.index[i] - tstep) &
(t_10s.index < t_5m.index[i] + tstep))].b.mean() for i in range(0,len(t_5m))]
However, I want to do this for a timeseries spanning at least two years and for many columns (not just b as here. Current solution is to for loop over the relevant columns). The code then gets very slow. Can anyone think of a trick to do this more efficiently? I have thought about using resample or groupby. That would work if I had a regular 5-minute interval, but since it is irregular between days, I cannot make it work. Grateful for any input!
Have looked around some, e.g. here, but couldn't find what I need.