Finding time difference between the message status in python

Question

So, I want to calculate the time differences; and the file looks something like this

id	message_id	send_date	status
0	5f74b996a2b7e	2020-10-01 00:00:07	sent
1	5f74b996a2b7e	2020-10-01 00:00:09	delivered
2	5f74b99e85b3c	2020-10-02 02:00:14	sent
3	5f74b99e85b3c	2020-10-02 02:01:16	delivered
4	5f74b99e85b3c	2020-10-02 08:06:49	read
5	5f74b996a2b7e	2020-10-02 15:16:32	read
6	5f9d97ff1af9e	2020-10-14 13:45:43	sent
7	5f9d97ff1af9e	2020-10-14 13:45:45	delivered
8	5f9d97ff1af9e	2020-10-14 13:50:48	read
9	5f74b9a35b6c5	2020-10-16 19:01:19	sent
10	5f74b9a35b6c5	2020-10-16 19:01:25	deleted

Inside is id which increment, message_id is unique to each message, send_date is the time, status is the message status (it has 5 statuses; sent, delivered, read, failed, and deleted).

I wanted to calculate the time differences when the message was sent then delivered, if delivered then read.

I know something like this can be handy, but I wasn't sure how to assign it uniquely to each of the message_id

from datetime import datetime
s1 = '2020-10-14 13:45:45'
s2 = '2020-10-14 13:50:48' # for example
FMT = '%Y-%m-%d %H:%M:%S'
tdelta = datetime.strptime(s2, FMT) - datetime.strptime(s1, FMT)
print(tdelta)

Ref: https://stackoverflow.com/questions/3096953/how-to-calculate-the-time-interval-between-two-time-strings

The expected output would be,

	message_id	delivered_diff	read_diff	deleted_diff
0	5f74b996a2b7e	00:00:02	1 day, 15:16:23
1	5f74b99e85b3c	00:01:02	6:05:33
2	5f9d97ff1af9e	00:00:02	0:05:03
3	5f74b9a35b6c5			0:00:06

Can you share your code and the expected output? – Carlo Zanocco Mar 20 '21 at 09:44 — Carlo Zanocco, Mar 20 '21 at 09:44
I've included something to clarify. Thanks! – sadil Mar 20 '21 at 10:04 — sadil, Mar 20 '21 at 10:04

Carlo Zanocco · Accepted Answer · 2021-03-20T10:54:54.413

You can use pandas to do this and datetime. The code is commented to better understand and realized with python 3.8.

import datetime
import pandas as pd

def time_delta(a, b):
    return datetime.datetime.strptime(b, '%Y-%m-%d %H:%M:%S') - datetime.datetime.strptime(a, '%Y-%m-%d %H:%M:%S') # calculate the timedelta

def calculate_diff(val, first_status, second_status):
    if not val['status'].str.contains(first_status).any() or not val['status'].str.contains(second_status).any(): # Check if the status exist
        return ''

    a = val.loc[val['status'] == first_status, 'send_date'].values[0] # Get the first send_date value for the first status value
    b = val.loc[val['status'] == second_status, 'send_date'].values[0] # Get the first send_date value for the second status value

    return time_delta(a, b) # calculate the delta

df = pd.read_csv('test.csv', sep=';') # Load csv file with ; as separator
grouped = df.groupby('message_id') # Group by message ids

final_df = pd.DataFrame(columns=['message_id', 'delivered_diff', 'read_diff', 'deleted_diff']) # create empty result dataframe
for message_id, values in grouped: # calculate the results for each group
    delivered_diff = calculate_diff(values, 'sent', 'delivered') # calculate delivered_diff as delta between sent status and delivered status
    read_diff = calculate_diff(values, 'delivered', 'read') # calculate read_diff as delta between delivered status and read status
    deleted_diff = calculate_diff(values, 'sent', 'deleted') # calculate deleted_diff as delta between sent status and deleted status

    res = {
        'message_id': message_id,
        'delivered_diff': delivered_diff,
        'read_diff': read_diff,
        'deleted_diff': deleted_diff
    }
    # append the results
    final_df = final_df.append(res, ignore_index=True)

# print final result
print(final_df)

The result:

      message_id   delivered_diff        read_diff     deleted_diff
0  5f74b996a2b7e  0 days 00:00:02  1 days 15:16:23
1  5f74b99e85b3c  0 days 00:01:02  0 days 06:05:33
2  5f74b9a35b6c5                                    0 days 00:00:06
3  5f9d97ff1af9e  0 days 00:00:02  0 days 00:05:03

Hi @RahmanArmenzaria if this or any answer has solved your question please consider accepting it by clicking the check-mark. This indicates to the wider community that you've found a solution and gives some reputation to both the answerer and yourself. There is no obligation to do this, also take a look to [this](https://meta.stackexchange.com/a/5235/315993) for more details. — Carlo Zanocco, Mar 20 '21 at 10:56

score 0 · Answer 2 · answered Mar 20 '21 at 12:11

import pandas as pd
from datetime import datetime, timedelta

final_dict = []
data = pd.read_csv('data.csv', names=['id','unique_id','time','status'])
data['time'] = pd.to_datetime(data['time'])
# data.info()
groupByUniqueId = data.groupby('unique_id') 
for name,group in groupByUniqueId:
  for row in group.iterrows():
        
    if row[1][3] == "sent":
      sent = row[1][2]
    if row[1][3] == "read":
      final_dict.append({row[1][1]: {"read": str(sent - row[1][2])}})
    elif row[1][3] == "delivered":
      final_dict.append({row[1][1]: {"delivered":str(sent - row[1][2])}})
    elif row[1][3] == "deleted":
      final_dict.append({row[1][1]: {"deleted":str(sent - row[1][2])}})

print(final_dict)

Data Sample for CSV

Finding time difference between the message status in python

2 Answers2