Check if row's date range overlap any previous rows date range in Python / Pandas Dataframe

Question

I have some data in a pandas dataframe that contains a rank column, a start date and an end date. The data is sorted on the rank column lowest to highest (consequently the start/end dates are unordered). I wish to remove every row whose date range overlaps ANY PREVIOUS rows'

By way of a toy example:

Raw Data

Rank    Start_Date  End_Date
1   1/1/2021    2/1/2021
2   1/15/2021   2/15/2021
3   12/7/2020   1/7/2021
4   5/1/2020    6/1/2020
5   7/10/2020   8/10/2020
6   4/20/2020   5/20/2020

Desired Result

Rank    Start_Date  End_Date
1   1/1/2021    2/1/2021
4   5/1/2020    6/1/2020
5   7/10/2020   8/10/2020

Explanation: Row 2 is removed because its start overlaps Row 1, Row 3 is removed because its end overlaps Row 1. Row 4 is retained as it doesn’t overlap any previously retained Rows (ie Row 1). Similarly, Row 5 is retained as it doesn’t overlap Row 1 or Row 4. Row 6 is removed because it overlaps with Row 4.

Attempts:

I can use np.where to check the previous row with the current row and create a column “overlap” and then subsequently filter this column. But this doesn’t satisfy my requirement (ie in the toy example above Row 3 would be included as it doesn’t overlap with Row2 but should be excluded as it overlaps with Row 1).


    df['overlap'] = np.where((df['start']> df['start'].shift(1)) &
                                  (df['start'] < df['end'].shift(1)),1 ,0)
    df['overlap'] = np.where((df['end'] < df['end'].shift(1)) &
                                  (df['end'] > df['start'].shift(1)), 1, df['overlap'])

I have tried an implementation based on answers from this question Removing 'overlapping' dates from pandas dataframe, using a lookback period from the End Date, but the length of days between my Start Date and End Date are not constant, and it doesnt seem to produce the correct answer anyway.

    target = df.iloc[0]
    day_diff = abs(target['End_Date'] - df['End_Date'])
    day_diff = day_diff.reset_index().sort_values(['End_Date', 'index'])
    day_diff.columns = ['old_index', 'End_Date']
    non_overlap = day_diff.groupby(day_diff['End_Date'].dt.days // window).first().old_index.values
    results = df.iloc[non_overlap]

What is the the date format? Month/day/year? – sammywemmy Aug 02 '21 at 21:35 — sammywemmy, Aug 02 '21 at 21:35

mozway · Accepted Answer · 2021-08-02T21:02:41.203

Two intervals overlap if (a) End2>Start1 and (b) Start2<End1: We can use numpy.triu to calculate those comparisons with the previous rows only:

a = np.triu(df['End_Date'].values>df['Start_Date'].values[:,None])
b = np.triu(df['Start_Date'].values<df['End_Date'].values[:,None])

The good rows are those that have only True on the diagonal for a&b

df[(a&b).sum(0)==1]

output:

Rank Start_Date   End_Date
   1 2021-01-01 2021-02-01
   4 2020-05-01 2020-06-01
   5 2020-07-10 2020-08-10

NB. as it needs to calculate the combination of rows, this method can use a lot of memory when the array becomes large, but it should be fast

I forgot to mention that I first converted the date columns to datetime type using `pandas.to_datetime` — mozway, Aug 02 '21 at 22:42

score 2 · Answer 2 · answered Aug 03 '21 at 04:30

Another option, that could help with memory usage, is a combination of IntervalIndex and a for loop:

Convert dates:

df.Start_Date = df.Start_Date.transform(pd.to_datetime, format='%m/%d/%Y')
df.End_Date = df.End_Date.transform(pd.to_datetime, format='%m/%d/%Y')

Create IntervalIndex:

intervals = pd.IntervalIndex.from_arrays(df.Start_Date, 
                                         df.End_Date, 
                                         closed='both')

Run a for loop (this avoids broadcasting, which while fast, can be memory intensive, depending on the array size):

index = np.arange(intervals.size)
keep = [] # indices of `df` to be retained    

# get rid of the indices where the intervals overlap
for interval in intervals:
    keep.append(index[0])
    checks = intervals[index].overlaps(intervals[index[0]])
    if checks.any():
        index = index[~checks]
    else:
        break
    if index.size == 0:
        break
df.loc[keep]
 
   Rank Start_Date   End_Date
0     1 2021-01-01 2021-02-01
3     4 2020-05-01 2020-06-01
4     5 2020-07-10 2020-08-10

Check if row's date range overlap any previous rows date range in Python / Pandas Dataframe

2 Answers2

Linked