I currently have a list of tuples that look like this:
time_constraints = [
('001', '01/01/2020 10:00 AM', '01/01/2020 11:00 AM'),
('001', '01/03/2020 05:00 AM', '01/03/2020 06:00 AM'),
...
('999', '01/07/2020 07:00 AM', '01/07/2020 08:00 AM')
]
where:
- each tuple contains an
id
,lower_bound
, andupper_bound
- none of the time frames overlap for a given
id
len(time_constraints)
can be on the order of 10^4 to 10^5.
My goal is to quickly and efficiently filter a relatively large (millions of rows) Pandas dataframe (df
) to include only the rows that match on the id
column and fall between the specified lower_bound
and upper_bound
times (inclusive).
My current plan is to do this:
import pandas as pd
output = []
for i, lower, upper in time_constraints:
indices = list(df.loc[(df['id'] == i) & (df['timestamp'] >= lower) & (df['timestamp'] <= upper), ].index)
output.extend(indices)
output_df = df.loc[df.index.isin(output), ].copy()
However, using a for-loop isn't ideal. I was wondering if there was a better solution (ideally vectorized) using Pandas or NumPy arrays that would be faster.
Edited:
Here's some sample rows of df
:
id | timestamp |
---|---|
1 | 01/01/2020 9:56 AM |
1 | 01/01/2020 10:32 AM |
1 | 01/01/2020 10:36 AM |
2 | 01/01/2020 9:42 AM |
2 | 01/01/2020 9:57 AM |
2 | 01/01/2020 10:02 AM |