0

I'm trying to quickly filter a pandas dataframe containing a series of timestamps, such that the times fall with a set of ranges (in a different dataframe).

Currently I use a filter function and use apply, which works but is quite slow. Am I missing an obvious solution?

import pandas as pd

# Dataframe of timestamps
df = pd.DataFrame(pd.date_range('20180117', end='20180118', freq='60s'),
                  columns=['time'])

# Dataframe of intervals
df2= pd.DataFrame([[pd.Timestamp('201801170005'), pd.Timestamp('201801170020')],
                   [pd.Timestamp('201801171415'), pd.Timestamp('201801171430')],
                   [pd.Timestamp('201801171800'), pd.Timestamp('201801171900')]],
                  columns=['start','end'])

def flag_during(timestamp, df):
    """Test if a timestamp falls between any [start, end] pairs"""
    return any((df['start']<timestamp) & (timestamp<df['end']))

# Add a column indicating if the time falls within any interval
%timeit df['During'] = df['time'].apply(lambda l: flag_during(l, df2))

Timeit returns

682 ms ± 2.71 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Result

I implemented the solution from How to join two dataframes for which column values are within a certain range? (or the accompanying answer) and came up with the following:

%%timeit
idx = pd.IntervalIndex.from_arrays(df2['start'], df2['end'], closed='both')
event =  ~pd.isna(df2['start'].reindex(idx.get_indexer(df['time'])))
df['During2'] = event.values

Returns

826 µs ± 7.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

About 1000 times faster.

Steve
  • 1,579
  • 10
  • 23

0 Answers0