7

Suppose I have two series of timestamps which are pairs of start/end times for various 5 hour ranges. They are not necessarily sequential, nor are they quantized to the hour.

import pandas as pd

start = pd.Series(pd.date_range('20190412',freq='H',periods=25))

# Drop a few indexes to make the series not sequential
start.drop([4,5,10,14]).reset_index(drop=True,inplace=True)

# Add some random minutes to the start as it's not necessarily quantized
start = start + pd.to_timedelta(np.random.randint(59,size=len(start)),unit='T')

end = start + pd.Timedelta('5H')

Now suppose that we have some data that is timestamped by minute, over a range that encompasses all start/end pairs.

data_series = pd.Series(data=np.random.randint(20, size=(75*60)), 
                        index=pd.date_range('20190411',freq='T',periods=(75*60)))

We wish to obtain the values from the data_series within the range of each start and end time. This can be done naively inside a loop

frm = []
for s,e in zip(start,end):
    frm.append(data_series.loc[s:e].values)

As we can see this naive approach loops over each pair of start and end dates, gets the values from data.

However this implementation is slow if len(start) is large. Is there a way to perform this sort of logic leveraging pandas vector functions?

I feel it is almost like I want to to apply .loc with a vector or pd.Series rather than a single pd.Timestamp?

EDIT

Using .apply is no more/marginally more efficient than using the naive for loop. I was hoping to be pointed in direction of a pure vector solution

mch56
  • 742
  • 6
  • 24
  • In reality how large are the two DataFrames? – ALollz Apr 12 '19 at 13:12
  • `len(data_series)` ~ between 10^5 and 10^6 and `len(start)` say between 10^4 and 10^5. Loop performs the operation ~ 5s, however, this operation may need to performed 10^5 times for various`start` vectors making it an unfeasible approach – mch56 Apr 12 '19 at 13:16
  • Are all entries of `end` at fixed interval from corresponding entries in `start`? Like it is `5H` for the given sample. – Divakar Apr 12 '19 at 14:06
  • 1
    They are @Divakar – mch56 Apr 12 '19 at 15:04

3 Answers3

4

Basic Idea

As usual pandas would spend time on searching for that one specific index at data_series.loc[s:e], where s and e are datetime indices. That's costly when looping and that's exactly where we would improve. We would find all those indices in a vectorized manner with searchsorted. Then, we would extract the values off data_series as an array and use those indices obtained from searchsorted with simple integer-based indexing. Thus, there would be a loop with minimal work of simple-slicing off an array.

General mantra being - Do most work with pre-processing in a vectorized manner and minimal when looping.

The implementation would look something like this -

def select_slices_by_index(data_series, start, end):
    idx = data_series.index.values
    S = np.searchsorted(idx,start.values)
    E = np.searchsorted(idx,end.values)
    ar = data_series.values
    return [ar[i:j] for (i,j) in zip(S,E+1)]

Use NumPy-striding

For the specific case when the time-period between starts and ends are same for all entries and all slices are covered by that length, i.e. no out-of-bounds cases, we can use NumPy's sliding window trick.

We can leverage np.lib.stride_tricks.as_strided based scikit-image's view_as_windows to get sliding windows. More info on use of as_strided based view_as_windows.

from skimage.util.shape import view_as_windows

def select_slices_by_index_strided(data_series, start, end):
    idx = data_series.index.values
    L = np.searchsorted(idx,end.values[0])-np.searchsorted(idx,start.values[0])+1
    S = np.searchsorted(idx,start.values)
    ar = data_series.values
    w = view_as_windows(ar,L)
    return w[S]

Use this post if you don't have access to scikit-image.


Benchmarking

Let's scale up everything by 100x on the given sample data and test out.

Setup -

np.random.seed(0)
start = pd.Series(pd.date_range('20190412',freq='H',periods=2500))

# Drop a few indexes to make the series not sequential
start.drop([4,5,10,14]).reset_index(drop=True,inplace=True)

# Add some random minutes to the start as it's not necessarily quantized
start = start + pd.to_timedelta(np.random.randint(59,size=len(start)),unit='T')

end = start + pd.Timedelta('5H')
data_series = pd.Series(data=np.random.randint(20, size=(750*600)), 
                        index=pd.date_range('20190411',freq='T',periods=(750*600)))

Timings -

In [156]: %%timeit
     ...: frm = []
     ...: for s,e in zip(start,end):
     ...:     frm.append(data_series.loc[s:e].values)
1 loop, best of 3: 172 ms per loop

In [157]: %timeit select_slices_by_index(data_series, start, end)
1000 loops, best of 3: 1.23 ms per loop

In [158]: %timeit select_slices_by_index_strided(data_series, start, end)
1000 loops, best of 3: 994 µs per loop

In [161]: frm = []
     ...: for s,e in zip(start,end):
     ...:     frm.append(data_series.loc[s:e].values)

In [162]: np.allclose(select_slices_by_index(data_series, start, end),frm)
Out[162]: True

In [163]: np.allclose(select_slices_by_index_strided(data_series, start, end),frm)
Out[163]: True

140x+ and 170x speedups with these ones!

kmario23
  • 57,311
  • 13
  • 161
  • 150
Divakar
  • 218,885
  • 19
  • 262
  • 358
1

You can take advantage of the apply function if you move your series into a Dataframe:

pdf = pd.DataFrame({'s': start,'e':end})
pdf.apply(lambda x: data_series.loc[x['s']:x['e']].values, axis=1)

Dask can help you parrallelize this computation for big data amounts.

http://docs.dask.org/en/latest/dataframe-api.html#dask.dataframe.DataFrame.apply https://github.com/dask/dask

thomas
  • 71
  • 6
  • `.apply` is just a loop under the hood and should be avoided for row based loops. It's still super slow when `len(pdf)` is large. I was hoping to be pointed in the direction of a proper vectorized solution. – mch56 Apr 12 '19 at 12:59
  • I'm not sure how this can be vectorized, however another direction worth looking at is IntervalIndex which is designed to handle this kind of indexing: https://stackoverflow.com/questions/46364710/finding-matching-intervals-in-pandas-intervalindex – thomas Apr 12 '19 at 13:23
1

You can find the indices where elements of start and end are in data_series using index.get_loc

ind_start = [data_series.index.get_loc(i) for i in start]
ind_end = [data_series.index.get_loc(i) for i in end]

Then using np.take_along_axis and np.r_ to peform the slice.

frm = [np.take_along_axis(data_series.values, np.r_[s,e],axis=0) for s,e in zip(ind_start,ind_end)]

using %timeit

%timeit [np.take_along_axis(data_series.values, np.r_[s,e],axis=0) for s,e in zip(ind_start,ind_end)]
425 µs ± 4.28 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Comparing to the for loop method using .loc

def timeme(start,end):
    frm = []
    for s,e in zip(start,end):
        frm.append(data_series.loc[s:e].values)

%timeit timeme(start,end)
2.99 ms ± 65.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
dubbbdan
  • 2,650
  • 1
  • 25
  • 43