Seeking efficient way to compare and filter overlaps in Pandas date ranges

Question

I have a dataframe with two columns containing date ranges.

The original data looks something like this:

df1:
personid, visitto, intime, outtime
8, 8, 2017-07-01 06:00:00, 2017-07-01 08:00:00
17 8, 2017-07-02 09:00:00, 2017-07-02 10:00:00

df2:
location, open, close
8, 2017-07-01 04:00:00, 2017-07-01 13:00:00

Here's what I have done so far: 1. Merge the two dataframes on visitto->location

merged_df:
personid, visitto, intime, outtime, location, open, close
8, 8, 2017-07-01 06:00:00, 2017-07-01 08:00:00, 8, 2017-07-01 04:00:00, 2017-07-01 13:00:00
17 8, 2017-07-02 09:00:00, 2017-07-02 10:00:00, 8, 2017-07-01 04:00:00, 2017-07-01 13:00:00

Convert the four columns with dates to pandas intervals

personid, visitto, visittime, opentime
8, 8, [2017-07-01 06:00:00, 2017-07-01 08:00:00], [2017-07-01 04:00:00, 2017-07-01 13:00:00]
17 8, [2017-07-02 09:00:00, 2017-07-02 10:00:00], [2017-07-01 04:00:00, 2017-07-01 13:00:00]

I now want to filter this merged dataframe of intervals to only return rows where the datetimes overlap. My expected result would be:

personid, visitto, visittime, opentime
8, 8, [2017-07-01 06:00:00, 2017-07-01 08:00:00], [2017-07-01 04:00:00, 2017-07-01 13:00:00]

I successfully was able to get to the desired outcome using iterrows() and comparing the overlaps line-by-line, however, this is extremely verbose. What I would prefer to do is something like (non-working example):

merged_df.loc[merged_df['visittime'].overlaps(merged_df['opentime'])]

Or, accomplish this during the merge of the dataframes. Something along the lines of (non-working example):

merge_df = pd.merge(df1[["personid", "visitto", "intime", "outtime"]], df2[["location", "open", "close"]], how='inner', left_on='visitto', right_on='location') #WHERE intime, outtime OVERLAPS open, close

Is it possible to perform this operation from a Pandas function rather than looping? This can easily be done from Sql in the database, however, I am struggling to find an efficient way to perform this operation on the dataframe.

doesn't the merge multiply the data? you get df1 x df2 rows, right? i assume this could lead to problems on fairly large datasets.... — A. Rabus, Feb 08 '21 at 14:30

score 1 · Answer 1 · answered Oct 28 '19 at 19:48

Instead of creating compound object columns (like lists), which you will always have to iterate over, use the builtin pandas datetime functionality (plus the answer from here):

from io import StringIO
import pandas as pd

csv_buff = StringIO("""personid,visitto,intime,outtime,location,open,close
8,8,2017-07-01 06:00:00,2017-07-01 08:00:00,8,2017-07-01 04:00:00,2017-07-01 13:00:00
17,8,2017-07-02 09:00:00,2017-07-02 10:00:00, 8,2017-07-01 04:00:00,2017-07-01 13:00:00""")
csv_buff.seek(0)
dtypes = {
    "intime": "datetime64",
    "outtime": "datetime64",
    "open": "datetime64",
    "close": "datetime64"
}
df = pd.read_csv(csv_buff).astype(dtypes)

df[(df["intime"] <= df["close"]) & (df["open"] <= df["outtime"])]

Results in:

   personid visitto              intime             outtime  location                open               close
0         8       8 2017-07-01 06:00:00 2017-07-01 08:00:00         8 2017-07-01 04:00:00 2017-07-01 13:00:00

score 0 · Answer 2 · answered Oct 28 '19 at 19:28

0

While it's still fundamentally a loop, this does accomplish this.

merged_df[merged_df.apply(lambda x: x['in1'].overlaps(x['in2']), axis = 1)]

answered Oct 28 '19 at 19:28

Adam Zeldin

898
4
6

score 0 · Answer 3 · answered Oct 28 '19 at 19:31

If you can use IntervalIndex with your pandas:

df1['visittime'] = pd.IntervalIndex.from_arrays(left=df1.intime,
                                                right=df1.outtime,
                                                closed='both')

df2['worktime'] = pd.IntervalIndex.from_arrays(left=df2['open'],
                                               right=df2['close'],
                                               closed='both')

new_df = df1.merge(df2, left_on='visitto', right_on='location', how='left')
new_df.loc[[a.overlaps(b) for a, b in zip(new_df['visittime'], new_df['worktime']) ]
          ]

Valdi_Bo · Answer 4 · 2019-10-28T20:26:55.573

0

I assume that all "time" colums are of datetime type.

Start from converting both times to an Interval and dropping (now redundant) source time columns (in both DataFrames):

df1['visittime'] = df1.apply(lambda row: pd.Interval(
    row.intime, row.outtime, closed='both'), axis=1)
df1.drop(columns=['intime', 'outtime'], inplace=True)
df2['opentime'] = df2.apply(lambda row: pd.Interval(
    row.open, row.close, closed='both'), axis=1)
df2.drop(columns=['open', 'close'], inplace=True)

Note: If for some reason the "original" time columns are needed, don't drop them.

Then merge them and filter out "wrong" rows:

df3 = df1.merge(df2, left_on='visitto', right_on='location')
df3 = df3[df3.apply(lambda row: row.visittime.overlaps(row.opentime), axis=1)]

No loops or row iterations, just calls of Pandas metods.

edited Oct 28 '19 at 20:26

answered Oct 28 '19 at 20:20

Valdi_Bo

30,023
4
23
41

"No loops or row iterations" You're doing implicit Python loops all throughout this example... – PMende Oct 28 '19 at 22:06
Of course, somewhere "under the hood" *Pandas* runs some loop, just like in many more methods. But at least programming by the end is easier, as he is free from programming such loops. – Valdi_Bo Oct 29 '19 at 07:53

Seeking efficient way to compare and filter overlaps in Pandas date ranges

4 Answers4