Python how to merge the time spans and make a bigger one

Question

I have the following dataframe.

       padel start_time  end_time  duration
38  Padel 10   08:00:00  09:00:00        60
40  Padel 10   10:00:00  11:30:00        90
42  Padel 10   10:30:00  12:00:00        90
44  Padel 10   11:00:00  12:30:00        90
46  Padel 10   11:30:00  13:00:00        90
49  Padel 10   16:00:00  17:30:00        90
51  Padel 10   16:30:00  18:00:00        90
53  Padel 10   17:00:00  18:30:00        90
55  Padel 10   17:30:00  19:00:00        90
57  Padel 10   18:00:00  19:30:00        90
59  Padel 10   18:30:00  20:00:00        90
61  Padel 10   19:00:00  20:30:00        90
63  Padel 10   19:30:00  21:00:00        90
65  Padel 10   20:00:00  21:30:00        90
67  Padel 10   20:30:00  22:00:00        90

I want to chose the longest timespans in between. The output I want should look like this

       padel start_time  end_time  duration
38  Padel 10   08:00:00  09:00:00        60
40  Padel 10   10:00:00  13:00:00        180
49  Padel 10   16:00:00  22:00:00        360

I not care about duration. I can do that. but how will i merge the time spans which overlap. Thanks

good question. If so, add padel to the sort (first), and add `and row['padel'] == next_row['padel']` to the `elif` condition. — Paul Fornia, Jan 07 '21 at 01:19

score 1 · Accepted Answer · edited Jan 07 '21 at 02:57

You can use shift() to create groups if start_time is greater than end_time of row above (i.e. overlapping).
We fillna with '24:00:00' so that we return 'True' for first value as nothing can be greater than 24 hours for a day. That's because NaN is the output in first row with shift() which would return False if we didn't do this.
That returns a boolean series of True and False (i.e. 1 and 0,. respectively), so you just take the cumulative sum with cumsum.
This creates a grp object, which we can include in groupby.

df = df.sort_values(by=['padel', 'start_time'], ascending=[True, True])
grp = df['start_time'].gt(df['end_time'].shift().fillna('24:00:00')).cumsum() 
df = df.groupby([grp, 'padel'], as_index=False).agg({'start_time':'first', 'end_time':'last'})
df['duration'] = ((pd.to_timedelta(df['end_time']) - 
                   pd.to_timedelta(df['start_time'])).dt.seconds / 60).astype(int)
Out[1]: 
      padel start_time  end_time  duration
0  Padel 10   08:00:00  09:00:00        60
1  Padel 10   10:00:00  13:00:00       180
2  Padel 10   16:00:00  22:00:00       360

Full Code with input dataframe

df = pd.DataFrame(pd.DataFrame({'padel': {38: 'Padel 10',
  40: 'Padel 10',
  42: 'Padel 10',
  44: 'Padel 10',
  46: 'Padel 10',
  49: 'Padel 10',
  51: 'Padel 10',
  53: 'Padel 10',
  55: 'Padel 10',
  57: 'Padel 10',
  59: 'Padel 10',
  61: 'Padel 10',
  63: 'Padel 10',
  65: 'Padel 10',
  67: 'Padel 10'},
 'start_time': {38: '08:00:00',
  40: '10:00:00',
  42: '10:30:00',
  44: '11:00:00',
  46: '11:30:00',
  49: '16:00:00',
  51: '16:30:00',
  53: '17:00:00',
  55: '17:30:00',
  57: '18:00:00',
  59: '18:30:00',
  61: '19:00:00',
  63: '19:30:00',
  65: '20:00:00',
  67: '20:30:00'},
 'end_time': {38: '09:00:00',
  40: '11:30:00',
  42: '12:00:00',
  44: '12:30:00',
  46: '13:00:00',
  49: '17:30:00',
  51: '18:00:00',
  53: '18:30:00',
  55: '19:00:00',
  57: '19:30:00',
  59: '20:00:00',
  61: '20:30:00',
  63: '21:00:00',
  65: '21:30:00',
  67: '22:00:00'},
 'duration': {38: 60,
  40: 90,
  42: 90,
  44: 90,
  46: 90,
  49: 90,
  51: 90,
  53: 90,
  55: 90,
  57: 90,
  59: 90,
  61: 90,
  63: 90,
  65: 90,
  67: 90}}))
grp = df['start_time'].gt(df['end_time'].shift().fillna('24:00:00')).cumsum() 
df = df.groupby([grp, 'padel'], as_index=False).agg({'start_time':'first', 'end_time':'last'})
df['duration'] = ((pd.to_timedelta(df['end_time']) - \
                   pd.to_timedelta(df['start_time'])).dt.seconds / 60).astype(int)
df

That's not giving correct output ``` Out[61]: padel start_time end_time duration 0 Padel 10 09:00:00 14:00:00 300 1 Padel 10 15:00:00 10:00:00 1140 ``` — gulbaz khan, Jan 07 '21 at 02:04
but it's giving me 9 to 14. but it should give 8 to 9. I also tried changing the columns to datetime then it gave me `TypeError: dtype datetime64[ns] cannot be converted to timedelta64[ns]` — gulbaz khan, Jan 07 '21 at 02:12
@gulbazkhan I would run the full code with input dataframe that I have included in my answer and identify what might be different with your actual data. This works 100% correctly on the sample data in your question. — David Erickson, Jan 07 '21 at 02:17
Aaah that makes sense @gulbazkhan if you wouldn't mind upvoting my solution as well, then I would greatly appreciate it! I have just upvoted your question as it was a good question. Also, if other solutions were helpful, then you can upvote those as well. — David Erickson, Jan 07 '21 at 02:30

score 1 · Answer 2 · answered Jan 07 '21 at 01:47

#Coeece the start and end times to datetime
df['start_time']=pd.to_datetime(df['start_time'])
df['end_time']=pd.to_datetime(df['end_time'])

g=df.groupby(df.end_time.sub(df.start_time.shift(1)).ne('2h').cumsum()).tail(1).reset_index()#Find last entry in each set of pedal

g=g.assign(start_time=df.groupby(df.end_time.sub(df.start_time.shift(1)).ne('2h').cumsum()).start_time.head(1).reset_index().loc[:,'start_time'])#Set start_time to the start_time in each set of pedal


g=g.iloc[:,:-1].join(df.groupby(df.end_time.sub(df.start_time.shift(1)).ne('2h').cumsum()).apply(lambda x: (x['end_time'].max()-(x['start_time'].min())).total_seconds()/60).to_frame('duration').reset_index(drop=True))#Calc the duration



    padel start_time  end_time  duration
0  Padel 10   08:00:00  09:00:00        60
1  Padel 10   10:00:00  13:00:00       180
2  Padel 10   16:00:00  22:00:00       360

The problem is solved. Btw. Thanks. I get this when running `ValueError: columns overlap but no suffix specified: Index(['duration'], dtype='object')` — gulbaz khan, Jan 07 '21 at 02:30
Works for me perfectly. What specific line in the code gives you that error? — wwnde, Jan 07 '21 at 02:34
the last line. occurs. btw. It solved my problem. duration wasn't a bigdeal. Thanks — gulbaz khan, Jan 07 '21 at 02:37

score 0 · Answer 3 · edited Jan 07 '21 at 01:05

0

I can't think of an easy pandas way to do it, so I'd just go with a for loop. Haven't tested this code, but something like:

df = df.sort_values(...)
out_df = pd.DataFrame(columns=df.columns)
next_row = None

for row in df.rows:
    if next_row is None:
        next_row = row
    elif row['start_time'] <= next_row['end_time']:
        next_row['end_time'] = row['end_time']
    else:
        out_df = out_df.append(next_row)
        next_row = None

out_df = out_df.append(next_row)

edited Jan 07 '21 at 01:05

Dharman

30,962
25
85
135

answered Jan 07 '21 at 01:00

Paul Fornia

452
3
9

i should get 10:00 It gives 10:30. all other is fine. same as 16:30 – gulbaz khan Jan 07 '21 at 02:36

Python how to merge the time spans and make a bigger one

3 Answers3