I am working on a similar problem as here I have a dataframe with two datetime columns, and I would need to identify overlaps.
import pandas as pd
from datetime import datetime
df = pd.DataFrame(columns=['id','from','to'], index=range(5), \
data=[[878,'2006-01-01','2007-10-01'],
[878,'2007-10-02','2008-12-01'],
[878,'2008-12-02','2010-04-03'],
[879,'2010-04-04','2199-05-11'],
[879,'2016-05-12','2199-12-31']])
df['from'] = pd.to_datetime(df['from'])
df['to'] = pd.to_datetime(df['to'])
The following works greatly to identify presence of overlaps as binary variable
df['overlap'] = (df.groupby('id')
.apply(lambda x: (x['to'].shift() - x['from']) > pd.Timedelta(seconds=0))
.reset_index(level=0, drop=True))
which returns (correctly):
[49]:
id from to overlap
0 878 2006-01-01 2007-10-01 False
1 878 2007-10-02 2008-12-01 False
2 878 2008-12-02 2010-04-03 False
3 879 2010-04-04 2199-05-11 False
4 879 2016-05-12 2199-12-31 True
I now would like to extend the solution by keeping the start of the overlap and the end of overlap, whenever there is an overlap. I have tried to have the apply return a pd.Series as in
df.groupby('id').apply(lambda x:
pd.Series([x['to'].shift() - x['from'] > pd.Timedelta(seconds=0),
x['from'],
x['to'].shift()],
index=['is_overlap','start_overlap','end_overlap']))
But the resulting dataframe as a completely changed shape (not 5 rows anymore). I just wanted
[49]:
id from to is_overlap start_overlap end_overlap
0 878 2006-01-01 2007-10-01 False np.NaT np.NaT
1 878 2007-10-02 2008-12-01 False np.NaT np.NaT
2 878 2008-12-02 2010-04-03 False np.NaT np.NaT
3 879 2010-04-04 2199-05-11 False np.NaT np.NaT
4 879 2016-05-12 2199-12-31 True 2016-05-12 2199-05-11