2

I have data with time intervals:

              Start               End
0  2022-01-01 00:00  2022-01-03 00:00
1  2022-02-01 00:00  2022-03-01 03:00
2  2022-04-01 11:00  2022-04-10 13:00
3  2022-08-01 12:00  2022-08-07 17:00

Having a list of time points, I select intervals that contain a certain Timestamp using boolean indexing:

result = []
for t in ['2022-01-02 00:00', '2022-04-01 11:00', '2022-08-01 12:00']:
    t = pd.Timestamp(t)
    df_sel = df[df['Start'].le(t) & df['End'].gt(t)]
    result.append(df_sel)

Resulting data frames have different lengths. What is the most efficient way to do boolean indexing with time data? Is it better to use NumPy or some other dtypes? How can I speed up my solution?

Mykola Zotko
  • 15,583
  • 3
  • 71
  • 73
  • I've been working on this question that is pretty much exactly what you are asking. You can try my solution for yourself and see if it works for you. I'm not quite done with it but I think what I have is pretty fast already: https://stackoverflow.com/a/73216154/7059681 – ringo Aug 05 '22 at 21:54
  • The [piso](https://piso.readthedocs.io) package may be useful for you. Specifically the [`contains`](https://piso.readthedocs.io/en/latest/reference/api/piso.contains.html#piso.contains) or [`lookup`](https://piso.readthedocs.io/en/latest/reference/api/piso.lookup.html) functions. – Riley Aug 06 '22 at 04:13
  • Check out [`pandas.IntervalIndex`](https://pandas.pydata.org/docs/reference/api/pandas.IntervalIndex.html). Combined with array-broadcasting and [`numpy.where`](https://numpy.org/doc/stable/reference/generated/numpy.where.html), you can easily do this in a couple lines, though this might not scale with huge datasets. You can checkout my previous comment for a more scalable solution. – ringo Aug 07 '22 at 21:22
  • IntervalIndex doesn't work in this case, as my intervals have gaps between them. – Mykola Zotko Aug 08 '22 at 07:14
  • If you are still working on this, please create some minimal reproducible example. – dankal444 Aug 16 '22 at 07:24

1 Answers1

1

Is it better to use NumPy or some other dtypes?
What is the most efficient way to do boolean indexing with time data?

Pandas uses Numpy internally for most functions. Numpy is not optimal but pretty good for this.

The thing is the solution is inefficient because it iterate over the whole dataframe for each t value in the loop. The resulting complexity is O(m n) where m is len(df) and n is the number of timestamps. For n=3 values, this solution can hardly be optimized using Pandas/Numpy but it can be made a bit faster with Numba/Cython which are a bit more low-level. I assume you have more items in practice.

How can I speed up my solution?

This kind of problem can be solve using an interval tree. The idea is to add the start-end indices to the interval tree and then query the tree with each searched timestamp. Regarding the exact interval tree data structure used, you can store the indices so to find out all the rows. Note that the complexity is O(n log m + k) where k is the number of row found. If there is too many values in output, then there is no fast solution because Pandas indexing is in O(k) anyway. Note that such data structure is not provided by standard Python module but there are some external module that appear to do that like this one.

If n is huge, an alternative solution is to sort the timestamps, put them in a Numpy array, and do a binary search (np.searchsorted) so to know for each row the timestamps that are included. The binary search can be vectorized so to search many values at a time. The appending can hardly be vectorized though. This solution runs in O(m log n).

Jérôme Richard
  • 41,678
  • 6
  • 29
  • 59