0

I am working on a project which requires me to identify, count and determine the duration of the concurrent events with the same type between any two of the subjects (e.g. machines). The data are organized in a time series (datetime type) with unequal time intervals (i.e. irregular time series). Below is the sample data. For each same machine, the events are continuous in time with no gaps.

MachineID,Event_StartTime,Event_EndTime,Event_Type,Duration
A,11/10/17 1:08,11/10/17 1:47,MAINTENANCE,0 days 00:39:00
A,11/10/17 1:47,11/10/17 2:56,RUN,0 days 01:09:00
A,11/10/17 2:56,11/10/17 3:41,STOP,0 days 00:45:00
A,11/10/17 3:41,11/10/17 7:33,RUN,0 days 03:52:00
B,11/10/17 7:29,11/10/17 14:54,STOP,0 days 07:25:00
A,11/10/17 7:33,11/23/17 14:44,STOP,13 days 07:11:00
C,11/10/17 10:17,11/10/17 17:07,STOP,0 days 06:50:00
B,11/10/17 14:54,11/10/17 15:53,MAINTENANCE,0 days 00:59:00
D,11/10/17 15:16,11/10/17 15:18,MAINTENANCE,0 days 00:02:00
D,11/10/17 15:18,11/20/17 13:40,RUN,9 days 22:22:00
B,11/10/17 15:53,11/12/17 12:18,RUN,1 days 20:25:00
E,11/10/17 16:57,11/10/17 17:08,STOP,0 days 00:11:00
C,11/10/17 17:07,11/10/17 17:52,MAINTENANCE,0 days 00:45:00
E,11/10/17 17:08,11/10/17 19:50,RUN,0 days 02:42:00
C,11/10/17 17:52,11/18/17 13:31,RUN,7 days 19:39:00
E,11/10/17 19:50,11/10/17 20:04,STOP,0 days 00:14:00

How data would look like in Excel

Expected Output:

For example, if I want to identify, count and determine the duration for any two of the machines concurrently in a "STOP" event, it should output:

A&B: Count = 1, Total Duration = 0 days 07:21:00
A&C: Count = 1, Total Duration = 0 days 06:50:00
A&E: Count = 2, Total Duration = 0 days 00:25:00
B&C: Count = 1, Total Duration = 0 days 04:37:00
C&E: Count = 1, Total Duration = 0 days 00:10:00
Other combinations: no concurrent "STOP" events

I am a bit new to Python and haven't been able to get to something close to what I want. However, I would like to have something that can run efficiently as I am dealing with a large number of machines and events.

Thank you very much in advance!

  • Welcome to Stack Overflow. You will need to include your attempted solution and ask a specific and detailed question about the problem you are having. This site is not a code-writing or a tutorial service. Please see [how to ask](https://stackoverflow.com/help/how-to-ask). – AlexK Mar 24 '23 at 19:12

1 Answers1

0

It's usually helpful if you also share your attempts, as we don't know what part(s) you're having issues with.

Determine Whether Two Date Ranges Overlap explains the logic to check for overlaps.

It also explains that the duration length is the minimum of the 4 (end - start) computations.

>>> df.head()
   index MachineID     Event_StartTime       Event_EndTime   Event_Type         Duration
0      0         A 2017-11-10 01:08:00 2017-11-10 01:47:00  MAINTENANCE  0 days 00:39:00
1      1         A 2017-11-10 01:47:00 2017-11-10 02:56:00          RUN  0 days 01:09:00
2      2         A 2017-11-10 02:56:00 2017-11-10 03:41:00         STOP  0 days 00:45:00
3      3         A 2017-11-10 03:41:00 2017-11-10 07:33:00          RUN  0 days 03:52:00
4      4         B 2017-11-10 07:29:00 2017-11-10 14:54:00         STOP  0 days 07:25:00

I personally find it easier to use SQL for this type of task.

Here's an example using duckdb which can be used directly - or it can read from dataframes.

Note: .reset_index() was used to create the index column in the dataframe which is used within the query for de-duplication purposes.

import duckdb

duckdb.sql("""
with 
stop_events as (from df where Event_Type = 'STOP'),
concurrent_stop_events as (
   from 
      stop_events t1, 
      stop_events t2
   select
      t1.MachineID id1,
      t2.MachineID id2,
      least( 
         t1.Event_EndTime - t1.Event_StartTime,
         t1.Event_EndTime - t2.Event_StartTime,
         t2.Event_EndTime - t1.Event_StartTime,
         t2.Event_EndTime - t2.Event_StartTime
      ) duration
      where
         t1.index < t2.index
         and
         t1.MachineID != t2.MachineID
         and 
         ((t2.Event_EndTime is NULL) or t1.Event_StartTime <= t2.Event_EndTime)
         and
         ((t1.Event_EndTime is NULL) or t1.Event_EndTime >= t2.Event_StartTime)
)
from 
   concurrent_stop_events        
select 
   id1,
   id2,
   count(*) count,
   to_seconds(
      sum(epoch(duration::interval))::bigint
   ) duration
group by all
""").df()

The .df() converts the result back into a pandas dataframe:

  id1 id2  count        duration
0   B   A      1 0 days 07:21:00
1   B   C      1 0 days 04:37:00
2   A   C      1 0 days 06:50:00
3   A   E      2 0 days 00:25:00
4   C   E      1 0 days 00:10:00
jqurious
  • 9,953
  • 1
  • 4
  • 14