Calculate the actual duration of successive or parallel task with Python Pandas

Question

I have a pandas dataframe with many rows. In each row I have an object and the duration of the machining on a certain machine (with a start time and an end time). Each object can be processed in several machines in succession. I need to find the actual duration of all jobs. For example:

Object	Machine	T start	T end
1	A	17:26	17:57
1	B	17:26	18:33
1	C	18:56	19:46
2	A	14:00	15:00
2	C	14:30	15:00
3	A	12:00	12:30
3	C	13:00	13:45

For object 1 the actual duration is 117 minutes,for object 2 is 60 minutes and for object 3 is 75 minutes. I tried with a groupby where I calculated the sum of the durations of the processes for each object and the minimum and maximum values, i.e. the first start and the last end. Then I wrote a function that compares these values but it doesn't work in case of object 1, and it works for object 2 and 3. Here my solution:

Object	min	max	sumT	LT_ACTUAL
1	17:26	19:46	148	140 ERROR!
2	14:00	15:00	90	60 OK!
3	12:00	13:45	75	75 OK!

def calc_lead_time(min_t_start, max_t_end, t_sum):
    t_max_min = (max_t_end - min_t_start) / pd.Timedelta(minutes=1)
    if t_max_min <= t_sum:
        return t_max_min
    else:
        return t_sum
    
df['LT_ACTUAL'] = df.apply(lambda x : calc_lead_time(x['min'], x['max'], x['sumT']), axis=1)

In SQL terminology, sounds like you want something along the lines of `SUM('T end' - 'T start')` (minding units/interval conversion) and `GROUP BY object`. What have you attempted yourself in solving this problem so far? — Rogue, Jan 29 '23 at 18:59

mozway · Accepted Answer · 2023-01-29T21:46:11.597

Assuming the data is sorted by start time, and that one task duration is not fully within another one, you can use:

start = pd.to_timedelta(df['T start']+':00')
end = pd.to_timedelta(df['T end']+':00')

s = start.groupby(df['Object']).shift(-1)

(end.mask(end.gt(s), s).sub(start)
    .groupby(df['Object']).sum()
)

Output:

Object
1   0 days 01:57:00
2   0 days 01:00:00
3   0 days 01:15:00
dtype: timedelta64[ns]

For minutes:

start = pd.to_timedelta(df['T start']+':00')
end = pd.to_timedelta(df['T end']+':00')

s = start.groupby(df['Object']).shift(-1)

(end.mask(end.gt(s), s).sub(start)
    .groupby(df['Object']).sum()
    .dt.total_seconds().div(60)
)

Output:

Object
1    117.0
2     60.0
3     75.0
dtype: float64

handling overlapping intervals

See here for the logic of the overlapping intervals grouping.

(df.assign(
        start=pd.to_timedelta(df['T start']+':00'),
        end=pd.to_timedelta(df['T end']+':00'),
        max_end=lambda d: d.groupby('Object')['end'].cummax(),
        group=lambda d: d['start'].ge(d.groupby('Object')['max_end'].shift()).cumsum()
    )
   .groupby(['Object', 'group'])
   .apply(lambda g: g['end'].max()-g['start'].min())
   .groupby(level='Object').sum()
   .dt.total_seconds().div(60)
)

Output:

Object
1    117.0
2     60.0
3     75.0
4     35.0
dtype: float64

Used input:


   Object Machine T start  T end
0       1       A   17:26  17:57
1       1       B   17:26  18:33
2       1       C   18:56  19:46
3       2       A   14:00  15:00
4       2       C   14:30  15:00
5       3       A   12:00  12:30
6       3       C   13:00  13:45
7       4       A   12:00  12:30
8       4       C   12:00  12:15
9       4       D   12:20  12:35

Thanks @mozway, unfortunately I have situations where some objects can be processed simultaneously in several machines, in these cases I have the same Tstart and Tend. It's a problem — Alessio, Jan 29 '23 at 19:32
It would help a lot if you have a clear and sound definition of the expected output. Would it be equivalent to the length of the union of all interval? Can we expect this union to be a single interval? Or is it something else you are looking for? — jlandercy, Jan 29 '23 at 20:24
@mozway I posted an image, each arrow represents a task on a machine with Tstart and Tend — Alessio, Jan 29 '23 at 21:11

G.G · Answer 2 · 2023-01-30T08:34:07.307

def function1(dd:pd.DataFrame):
    col1=dd.apply(lambda ss:pd.date_range(ss["T start"]+pd.to_timedelta("1 min"),ss["T end"],freq="min"),axis=1).explode()
    min=col1.min()-pd.to_timedelta("1 min")
    max=col1.max()
    sumT=col1.size
    LT_ACTUAL=col1.drop_duplicates().size

    return pd.DataFrame({"min":min.strftime('%H:%M'),"max":max.strftime('%H:%M'),"sumT":sumT,"LT_ACTUAL":LT_ACTUAL,},index=[dd.name])

df1.groupby('Object').apply(function1).droplevel(0)

out：

     min    max  sumT  LT_ACTUAL
1  17:26  19:46   148        117
2  14:00  15:00    90         60
3  12:00  13:45    75         75

Calculate the actual duration of successive or parallel task with Python Pandas

2 Answers2

handling overlapping intervals