1

Currently, I am pretty stuck on a starting point for a problem I am trying to figure out. I have a dataframe that contains 4 columns. I want to try and find the overlapping times by day and id. for example, my df looks like the following:

+------+--------------+-----------------------+----------------------+
| id   |   date       |  time_start           |  end_time            |
+--------------------------------------------------------------------+
| 123  |   2019-11-10 |  2019-11-10 08:00:00  |  2019-11-10 08:30:00 |
|      |              |                       |                      |
| 123  |   2019-11-10 |  2019-11-10 08:15:00  |  2019-11-10 08:30:00 |
|      |              |                       |                      |
| 123  |   2019-11-10 |  2019-11-10 08:25:00  |  2019-11-10 08:45:00 |
|      |              |                       |                      |
| 123  |   2019-11-11 |  2019-11-11 08:00:00  |  2019-11-11 08:30:00 |
|      |              |                       |                      |
| 123  |   2019-11-11 |  2019-11-11 08:30:00  |  2019-11-11 09:00:00 |
+------+--------------+-----------------------+----------------------+

import pandas as pd 

data = {'id':['123', '123', '123', '123','123'], 'date':['2019-11-10', '2019-11-10', '2019-11-10', '2019-11-11', '2019-11-11'],
        'time_start':['2019-11-10 08:00:00', '2019-11-10 08:15:00', '2019-11-10 08:25:00', '2019-11-11 08:00:00', '2019-11-11 08:30:00'],
        'end_time':['2019-11-10 08:30:00','2019-11-10 08:30:00','2019-11-10 08:45:00','2019-11-11 08:30:00','2019-11-11 09:00:00']}

df = pd.DataFrame(data)

,id,date,time_start,end_time
0,123,2019-11-10,2019-11-10 08:00:00,2019-11-10 08:30:00
1,123,2019-11-10,2019-11-10 08:15:00,2019-11-10 08:30:00
2,123,2019-11-10,2019-11-10 08:25:00,2019-11-10 08:45:00
3,123,2019-11-11,2019-11-11 08:00:00,2019-11-11 08:30:00
4,123,2019-11-11,2019-11-11 08:30:00,2019-11-11 09:00:00

I would like to see a result similar to the following:

+----+------------+----------------------+---------------------+---------------+-------------------------+-----------------+
|id  | date       |  time_start          | time_end            | overlap_count |  total_minutes_recorded |   actual_minutes|
+--------------------------------------------------------------------------------------------------------------------------+
|123 | 2019-11-10 |  2019-11-10 08:00:00 | 2019-11-10 08:45:00 | 3             |  65                     |   45            |
|    |            |                      |                     |               |                         |                 |
|123 | 2019-11-11 |  2019-11-11 08:00:00 | 2019-11-11 09:00:00 | 0             |  60                     |   60            |
+----+------------+----------------------+---------------------+---------------+-------------------------+-----------------+

I looked at other answers that start to give me insight on how to solve this problem, for example:

Pandas: Count time interval intersections over a group by

Most of these answers are just giving me a count of overlapping times, and it is taking a long awhile to compute. Are there any tips on how to start to solve this problem

BeRT2me
  • 12,699
  • 2
  • 13
  • 31
ghawes
  • 193
  • 1
  • 11
  • you want to get values for same start time with 1 day of difference? – sygneto Aug 13 '20 at 06:51
  • 1
    Please do not post data in markdown tables, they are not easily reproducible. Please see [How to provide a reproducible copy of your DataFrame using `df.head(10).to_clipboard()`](https://stackoverflow.com/questions/52413246), then **[edit] your question**, and paste the clipboard into a code block. – Trenton McKinney Aug 13 '20 at 06:51
  • You have shown wrong data in expected output: `time_end` if different than in main df – sygneto Aug 13 '20 at 07:50
  • @TrentonMcKinney thank you. I will do that – ghawes Aug 13 '20 at 14:15
  • Thank you @sygneto I have now corrected those dates – ghawes Aug 13 '20 at 14:41

2 Answers2

1

I dont know how you get overlap_count and values of total_minutes_recorded oin first row, its mistake i guess

df= pd.DataFrame({
    'id':[123,123,123,123,123],
    'date':['2019-11-10','2019-11-10','2019-11-10','2019-11-11','2019-11-11'],
    'time_start':['2019-11-10 08:00:00','2019-11-10 08:15:00','2019-11-10 08:25:00',
                  '2019-11-11 08:00:00','2019-11-11 08:30:00'],
    'end_time':['2019-11-10 08:30:00','2019-11-10 08:30:00','2019-11-10 08:45:00',
                '2019-11-11 08:30:00','2019-11-11 09:00:00']
})
df['date'] = pd.to_datetime(df['date'])
df['time_start'] = pd.to_datetime(df['time_start'])
df['end_time'] = pd.to_datetime(df['end_time'])
df_temp=df
df = pd.merge(df,df_temp,on='id')
df=df[
    ((df.time_start_x - df.time_start_y) == np.timedelta64(1, 'D'))
]
df_temp=df[['id','date_x','time_start_x','end_time_x']]
df_temp1 = df[['id','date_y','time_start_y','end_time_y']]
df_temp=df_temp.rename(columns={"date_x": "date", "time_start_x": "time_start",
                                "end_time_x":"end_time"})
df_temp1=df_temp1.rename(columns={"date_y": "date", "time_start_y": "time_start",
                                "end_time_y":"end_time"})

df=pd.concat([df_temp,df_temp1])
df=df[['id','date','time_start','end_time']].sort_values(by='date')

df['total_minutes_recorded'] = df['end_time']-df['time_start']

print(df)
     id       date          time_start            end_time total_minutes_recorded
15  123 2019-11-10 2019-11-10 08:00:00 2019-11-10 08:30:00               00:30:00
15  123 2019-11-11 2019-11-11 08:00:00 2019-11-11 08:30:00               00:30:00
sygneto
  • 1,761
  • 1
  • 13
  • 26
  • The last value is supposed to have an end_time of 9:00:00. As shown, this has a value of 8:30:00. Something's not right. – macetw Oct 06 '21 at 22:18
0

Use a groupby to group by dates and then define a function that takes each date as a dataframe. I'll give you get_minutes_recorded. get_overlap_counts is a bit more involved - you can tackle it by keeping a vector of 0s per index, loop through all your dates, i, and if end_date[i] of row i is between start and end of row n, make the vector[n] = 1.

def function(sub_df):
    overlap_count = get_overlap_count(sub_df)
    total_minutes_recorded = get_minutes_recorded(sub_df)
    return overlap_count, total_minutes_recorded

def get_overlap_counts(df):
    pass

def get_minutes_recorded(df):
    return (df[end_time] - df[start_time]).dt.seconds.sum()
    

df.groupby('date').apply(function)
Landmaster
  • 1,043
  • 2
  • 13
  • 21