For every customer_id I have several start dates and end dates.
When a customer has several overlapping date ranges I would like to reduce those to one line that has the minimum start date of the overlapping date ranges and the maximum end date of the overlapping date ranges.
Here's my example data frame:
customer_id start_date end_date
1 2019-01-01 2019-03-01
1 2020-01-02 2020-03-01
1 2020-01-03 2020-05-04
1 2020-01-05 2020-06-01
1 2020-01-07 2020-02-02
1 2020-09-03 2020-09-05
1 2020-09-04 2020-09-04
1 2020-10-01 NaT
2 2020-05-01 2020-05-03
This is what the end result should look like:
customer_id start_date end_date
1 2019-01-01 2019-03-01
1 2020-01-02 2020-06-01
1 2020-09-03 2020-09-05
1 2020-10-01 NaT
2 2020-05-01 2020-05-03
I've tried the following already, but that didn't really work out:
Find date range overlap in python
Here's sample code that generated these examples:
import pandas as pd
df = pd.DataFrame(data=[
[1, '2019-01-01', '2019-03-01'],
[1, '2020-01-03', '2020-05-04'],
[1, '2020-01-05', '2020-06-01'],
[1, '2020-01-02', '2020-03-01'],
[1, '2020-01-07', '2020-02-02'],
[1, '2020-09-03', '2020-09-05'],
[1, '2020-09-04', '2020-09-04'],
[1, '2020-10-01', None],
[2, '2020-05-01', '2020-05-03']],
columns=['customer_id', 'start_date', 'end_date'],
)
df['start_date'] = pd.to_datetime(df['start_date'])
df['end_date'] = pd.to_datetime(df['end_date'])
df.sort_values(by=['customer_id', 'start_date', 'end_date'])
expected_df = pd.DataFrame(data=[
[1, '2019-01-01', '2019-03-01'],
[1, '2020-01-02', '2020-06-01'],
[1, '2020-09-03', '2020-09-05'],
[1, '2020-10-01', None],
[2, '2020-05-01', '2020-05-03']],
columns=['customer_id', 'start_date', 'end_date'],
)
expected_df['start_date'] = pd.to_datetime(expected_df['start_date'])
expected_df['end_date'] = pd.to_datetime(expected_df['end_date'])
expected_df.sort_values(by=['customer_id', 'start_date', 'end_date'])