I have some data in a pandas dataframe that contains a rank column, a start date and an end date. The data is sorted on the rank column lowest to highest (consequently the start/end dates are unordered). I wish to remove every row whose date range overlaps ANY PREVIOUS rows'
By way of a toy example:
Raw Data
Rank Start_Date End_Date
1 1/1/2021 2/1/2021
2 1/15/2021 2/15/2021
3 12/7/2020 1/7/2021
4 5/1/2020 6/1/2020
5 7/10/2020 8/10/2020
6 4/20/2020 5/20/2020
Desired Result
Rank Start_Date End_Date
1 1/1/2021 2/1/2021
4 5/1/2020 6/1/2020
5 7/10/2020 8/10/2020
Explanation: Row 2 is removed because its start overlaps Row 1, Row 3 is removed because its end overlaps Row 1. Row 4 is retained as it doesn’t overlap any previously retained Rows (ie Row 1). Similarly, Row 5 is retained as it doesn’t overlap Row 1 or Row 4. Row 6 is removed because it overlaps with Row 4.
Attempts:
- I can use np.where to check the previous row with the current row and create a column “overlap” and then subsequently filter this column. But this doesn’t satisfy my requirement (ie in the toy example above Row 3 would be included as it doesn’t overlap with Row2 but should be excluded as it overlaps with Row 1).
df['overlap'] = np.where((df['start']> df['start'].shift(1)) &
(df['start'] < df['end'].shift(1)),1 ,0)
df['overlap'] = np.where((df['end'] < df['end'].shift(1)) &
(df['end'] > df['start'].shift(1)), 1, df['overlap'])
- I have tried an implementation based on answers from this question Removing 'overlapping' dates from pandas dataframe, using a lookback period from the End Date, but the length of days between my Start Date and End Date are not constant, and it doesnt seem to produce the correct answer anyway.
target = df.iloc[0]
day_diff = abs(target['End_Date'] - df['End_Date'])
day_diff = day_diff.reset_index().sort_values(['End_Date', 'index'])
day_diff.columns = ['old_index', 'End_Date']
non_overlap = day_diff.groupby(day_diff['End_Date'].dt.days // window).first().old_index.values
results = df.iloc[non_overlap]