Suppose I have two series of timestamps which are pairs of start/end times for various 5 hour ranges. They are not necessarily sequential, nor are they quantized to the hour.
import pandas as pd
start = pd.Series(pd.date_range('20190412',freq='H',periods=25))
# Drop a few indexes to make the series not sequential
start.drop([4,5,10,14]).reset_index(drop=True,inplace=True)
# Add some random minutes to the start as it's not necessarily quantized
start = start + pd.to_timedelta(np.random.randint(59,size=len(start)),unit='T')
end = start + pd.Timedelta('5H')
Now suppose that we have some data that is timestamped by minute, over a range that encompasses all start/end pairs.
data_series = pd.Series(data=np.random.randint(20, size=(75*60)),
index=pd.date_range('20190411',freq='T',periods=(75*60)))
We wish to obtain the values from the data_series
within the range of each start
and end
time. This can be done naively inside a loop
frm = []
for s,e in zip(start,end):
frm.append(data_series.loc[s:e].values)
As we can see this naive approach loops over each pair of start
and end
dates, gets the values from data.
However this implementation is slow if len(start)
is large. Is there a way to perform this sort of logic leveraging pandas
vector functions?
I feel it is almost like I want to to apply .loc
with a vector or pd.Series
rather than a single pd.Timestamp
?
EDIT
Using .apply
is no more/marginally more efficient than using the naive for
loop. I was hoping to be pointed in direction of a pure vector solution