- You can use
shift()
to create groups if start_time
is greater than
end_time
of row above (i.e. overlapping).
- We
fillna
with '24:00:00'
so that we return 'True' for first value as nothing can be greater than 24 hours for a day. That's because NaN
is the output in first row with shift()
which would return False
if we didn't do this.
- That returns a
boolean
series of True
and False
(i.e. 1
and 0
,. respectively), so you just take the cumulative sum with cumsum
.
- This creates a
grp
object, which we can include in groupby
.
df = df.sort_values(by=['padel', 'start_time'], ascending=[True, True])
grp = df['start_time'].gt(df['end_time'].shift().fillna('24:00:00')).cumsum()
df = df.groupby([grp, 'padel'], as_index=False).agg({'start_time':'first', 'end_time':'last'})
df['duration'] = ((pd.to_timedelta(df['end_time']) -
pd.to_timedelta(df['start_time'])).dt.seconds / 60).astype(int)
Out[1]:
padel start_time end_time duration
0 Padel 10 08:00:00 09:00:00 60
1 Padel 10 10:00:00 13:00:00 180
2 Padel 10 16:00:00 22:00:00 360
Full Code with input dataframe
df = pd.DataFrame(pd.DataFrame({'padel': {38: 'Padel 10',
40: 'Padel 10',
42: 'Padel 10',
44: 'Padel 10',
46: 'Padel 10',
49: 'Padel 10',
51: 'Padel 10',
53: 'Padel 10',
55: 'Padel 10',
57: 'Padel 10',
59: 'Padel 10',
61: 'Padel 10',
63: 'Padel 10',
65: 'Padel 10',
67: 'Padel 10'},
'start_time': {38: '08:00:00',
40: '10:00:00',
42: '10:30:00',
44: '11:00:00',
46: '11:30:00',
49: '16:00:00',
51: '16:30:00',
53: '17:00:00',
55: '17:30:00',
57: '18:00:00',
59: '18:30:00',
61: '19:00:00',
63: '19:30:00',
65: '20:00:00',
67: '20:30:00'},
'end_time': {38: '09:00:00',
40: '11:30:00',
42: '12:00:00',
44: '12:30:00',
46: '13:00:00',
49: '17:30:00',
51: '18:00:00',
53: '18:30:00',
55: '19:00:00',
57: '19:30:00',
59: '20:00:00',
61: '20:30:00',
63: '21:00:00',
65: '21:30:00',
67: '22:00:00'},
'duration': {38: 60,
40: 90,
42: 90,
44: 90,
46: 90,
49: 90,
51: 90,
53: 90,
55: 90,
57: 90,
59: 90,
61: 90,
63: 90,
65: 90,
67: 90}}))
grp = df['start_time'].gt(df['end_time'].shift().fillna('24:00:00')).cumsum()
df = df.groupby([grp, 'padel'], as_index=False).agg({'start_time':'first', 'end_time':'last'})
df['duration'] = ((pd.to_timedelta(df['end_time']) - \
pd.to_timedelta(df['start_time'])).dt.seconds / 60).astype(int)
df