2

I currently have a list of tuples that look like this:

time_constraints = [
    ('001', '01/01/2020 10:00 AM', '01/01/2020 11:00 AM'),
    ('001', '01/03/2020 05:00 AM', '01/03/2020 06:00 AM'),
    ...
    ('999', '01/07/2020 07:00 AM', '01/07/2020 08:00 AM')
]

where:

  • each tuple contains an id, lower_bound, and upper_bound
  • none of the time frames overlap for a given id
  • len(time_constraints) can be on the order of 10^4 to 10^5.

My goal is to quickly and efficiently filter a relatively large (millions of rows) Pandas dataframe (df) to include only the rows that match on the id column and fall between the specified lower_bound and upper_bound times (inclusive).

My current plan is to do this:

import pandas as pd

output = []
for i, lower, upper in time_constraints:
    indices = list(df.loc[(df['id'] == i) & (df['timestamp'] >= lower) & (df['timestamp'] <= upper), ].index)
    output.extend(indices)

output_df = df.loc[df.index.isin(output), ].copy()

However, using a for-loop isn't ideal. I was wondering if there was a better solution (ideally vectorized) using Pandas or NumPy arrays that would be faster.

Edited:

Here's some sample rows of df:

id timestamp
1 01/01/2020 9:56 AM
1 01/01/2020 10:32 AM
1 01/01/2020 10:36 AM
2 01/01/2020 9:42 AM
2 01/01/2020 9:57 AM
2 01/01/2020 10:02 AM
ultimate09
  • 23
  • 4
  • Very interesting question! I asked something similar over [here](https://codereview.stackexchange.com/questions/264144/calculate-the-mean-of-a-nested-group-if-the-condition-holds) yesterday, but got no answers yet. – qualitaetsmuell Jul 18 '21 at 19:09
  • Why df contains the same fields as time_constraints? I guess df has only one datetime field? Please, update your post with a sample of df. – Corralien Jul 18 '21 at 20:49
  • @Corralien updated the post, it should make more sense now – ultimate09 Jul 18 '21 at 23:19
  • @qualitaetsmuell. Don't hesitate to post your question on SO and post a comment here to notify me. – Corralien Jul 19 '21 at 13:23
  • Hey @Corralien, thanks for your offer of help! Due to the great people on CR, I managed to reduce the runtime time by a factor of 5. An updated version of the code can be found [here](https://colab.research.google.com/drive/1KmA4BdozEu09MWHelF-NBgqZjsFw6OHn?usp=sharing). Do you see any room for further improvement? If so, I'd gladly take your offer to help and post the question on SO. – qualitaetsmuell Jul 20 '21 at 14:15
  • @qualitaetsmuell, please post your question on SO. I have a one-line answer that matches your sample (2,000,000 records in less than one minute) :-) Don't forget to notify me here. – Corralien Jul 20 '21 at 20:45
  • Hey @Corralien, I have posted the question [here](https://stackoverflow.com/questions/68470617/calculate-the-mean-of-a-nested-group-if-the-condition-holds) :) – qualitaetsmuell Jul 21 '21 at 13:44

2 Answers2

1

I already answered for a similar case.

To test, I used 100,000 constraints (tc) and 5,000,000 of records (df). Is it what you expect

>>> df
          id           timestamp
0        565 2020-08-16 05:40:55
1        477 2020-04-05 22:21:40
2        299 2020-02-22 04:54:34
3        108 2020-08-17 23:54:02
4        041 2020-09-10 10:01:31
...      ...                 ...
4999995  892 2020-12-27 16:16:35
4999996  373 2020-08-29 05:44:34
4999997  659 2020-05-23 20:48:15
4999998  858 2020-09-08 22:58:20
4999999  710 2020-04-10 08:03:14

[5000000 rows x 2 columns]


>>> tc
        id         lower_bound         upper_bound
0      000 2020-01-01 00:00:00 2020-01-04 14:00:00
1      000 2020-01-04 15:00:00 2020-01-08 05:00:00
2      000 2020-01-08 06:00:00 2020-01-11 20:00:00
3      000 2020-01-11 21:00:00 2020-01-15 11:00:00
4      000 2020-01-15 12:00:00 2020-01-19 02:00:00
...    ...                 ...                 ...
99995  999 2020-12-10 09:00:00 2020-12-13 23:00:00
99996  999 2020-12-14 00:00:00 2020-12-17 14:00:00
99997  999 2020-12-17 15:00:00 2020-12-21 05:00:00
99998  999 2020-12-21 06:00:00 2020-12-24 20:00:00
99999  999 2020-12-24 21:00:00 2020-12-28 11:00:00

[100000 rows x 3 columns]
# from tqdm import tqdm
from itertools import chain

# df = pd.DataFrame(data, columns=['id', 'timestamp'])
tc = pd.DataFrame(time_constraints, columns=['id', 'lower_bound', 'upper_bound'])
g1 = df.groupby('id')
g2 = tc.groupby('id')

indexes = []
# for id_ in tqdm(tc['id'].unique()):
for id_ in tc['id'].unique():
    df1 = g1.get_group(id_)
    df2 = g2.get_group(id_)

    ii = pd.IntervalIndex.from_tuples(list(zip(df2['lower_bound'], 
                                               df2['upper_bound'])),
                                      closed='both')
    indexes.append(pd.cut(df1['timestamp'], bins=ii).dropna().index)

out = df.loc[chain.from_iterable(indexes)]

Performance:

100%|█████████████████████████████████████████████████| 1000/1000 [00:17<00:00, 58.40it/s]

Output result:

>>> out
          id           timestamp
1326     000 2020-11-10 05:51:00
1685     000 2020-10-07 03:12:48
2151     000 2020-05-08 11:11:18
2246     000 2020-07-06 07:36:57
3995     000 2020-02-02 04:39:11
...      ...                 ...
4996406  999 2020-02-19 15:27:06
4996684  999 2020-02-05 11:13:56
4997408  999 2020-07-09 09:31:31
4997896  999 2020-04-10 03:26:13
4999674  999 2020-04-21 22:57:04

[4942976 rows x 2 columns]  # 57024 records filtered
Corralien
  • 109,409
  • 8
  • 28
  • 52
0

You can use boolean indexing, likewise:

output_df = df[pd.Series(list(zip(df['id'], 
               df['lower_bound'],
               df['upper_bound']))).isin(time_constraints)]

The zip function is creating tuples from each column and then comparing it with your list of tuple. The pd.Series is used to create a Boolean series.

Shivam Roy
  • 1,961
  • 3
  • 10
  • 23
  • Thanks for the answer. I updated the question slightly so I don't think the `.isin()` solution will work. I'm looking for cases where the `id` from `time_constraints` matches the `df` but also where the timestamp from the `df` falls between `lower_bound` and `upper_bound` (inclusive). – ultimate09 Jul 18 '21 at 19:23