Python - How to create running list of occurrence of values and calculate time difference between first and last?

Question

I have dataset with 3columns. CustomerID, Order# and Time. In this dataset you can see that some customers placed more than one order in different time. For example, customer with ID4 placed order 4 at 2:14pm and then order 5 at 6:17pm. What I need to do is to identify the time between first and last order.

    CustomerID  Order#  Time
0   1   1   2021-06-16 19:11:21
1   2   2   2021-06-17 19:24:19
2   3   3   2021-06-16 19:22:42
3   4   4   2021-06-14 14:16:50
4   4   5   2021-06-14 18:17:48
5   5   6   2021-06-16 21:11:06
6   6   7   2021-06-15 17:02:19
7   7   8   2021-06-13 21:01:36
8   8   9   2021-06-16 18:40:14
9   8   10  2021-06-14 14:46:59
10  8   11  2021-06-14 14:46:59
11  9   12  2021-06-17 09:10:15
12  10  13  2021-06-15 02:49:5

In excel I would sort the CustomerID from lowest to highest, then create column with running list of occurrence of ID and then I would calculate the difference in time of the first occurrence and last occurrence of same ID.

In python I tried this:

import pandas as pd
orders = pd.read_csv("data/Orders.csv")
occur = orders.groupby(['CustomerID']).size()

But the problem with this is that it is grouping Customer id and i will lose the time dimension. Ultimately i need to create additional Colum that where i will calculate the time between first occurrence and last occurrence for the same CustomerID.

So the output would look like this:

Is there any way to do this in python?

You are expected to add your data as text so that it can be copied, take a look at [How to ask Pandas Question?](https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples) — ThePyGuy, Jun 20 '21 at 01:55
So do you want the output only for the records which has multiple orders? Since difference between first and last order will make no sense for non-repeating orders. — ThePyGuy, Jun 20 '21 at 02:03
Yes. so just for the customers who had more than one order. For those who had only one it should be blank. — Slavisha84, Jun 20 '21 at 02:04
What do you mean "it should be blank" is blank like NaT, or 0 seconds? If there are three entries should the time difference go in all 3 rows? Something else? Expected output is really helpful in these situations. — Henry Ecker, Jun 20 '21 at 02:09

score 2 · Answer 1 · answered Jun 20 '21 at 02:19

Groupby CustomerID then calculate the aggregation for minimum and maximum values of Time column, and then just take the difference, and finally merge back to the original dataframe.

df['Time'] = pd.to_datetime(df['Time'])
minMax = df.groupby('CustomerID').agg(minTime=('Time', 'min'), maxTime=('Time', 'max'))
minMax['Diff'] = minMax['maxTime'] - minMax['minTime']
df = df.merge(minMax[['Diff']], on='CustomerID')

OUTPUT:

    CustomerID  Order#                Time            Diff
0            1       1 2021-06-16 19:11:21 0 days 00:00:00
1            2       2 2021-06-17 19:24:19 0 days 00:00:00
2            3       3 2021-06-16 19:22:42 0 days 00:00:00
3            4       4 2021-06-14 14:16:50 0 days 04:00:58
4            4       5 2021-06-14 18:17:48 0 days 04:00:58
5            5       6 2021-06-16 21:11:06 0 days 00:00:00
6            6       7 2021-06-15 17:02:19 0 days 00:00:00
7            7       8 2021-06-13 21:01:36 0 days 00:00:00
8            8       9 2021-06-16 18:40:14 2 days 03:53:15
9            8      10 2021-06-14 14:46:59 2 days 03:53:15
10           8      11 2021-06-14 14:46:59 2 days 03:53:15
11           9      12 2021-06-17 09:10:15 0 days 00:00:00
12          10      13 2021-06-15 02:49:05 0 days 00:00:00

Henry Ecker · Answer 2 · 2021-06-20T02:25:54.547

Another option with groupby max - groupby min:

df['Time'] = pd.to_datetime(df['Time'])

g = df.groupby('CustomerID')['Time']
df['duration'] = g.max() - g.min()

Then with Series.where to clean up the values

df['duration'] = df['duration'].where(
    df['duration'] > pd.Timedelta(seconds=0),
    pd.NaT
)

    CustomerID  Order#                Time        duration
0            1       1 2021-06-16 19:11:21             NaT
1            2       2 2021-06-17 19:24:19             NaT
2            3       3 2021-06-16 19:22:42             NaT
3            4       4 2021-06-14 14:16:50             NaT
4            4       5 2021-06-14 18:17:48 0 days 04:00:58
5            5       6 2021-06-16 21:11:06             NaT
6            6       7 2021-06-15 17:02:19             NaT
7            7       8 2021-06-13 21:01:36             NaT
8            8       9 2021-06-16 18:40:14 2 days 03:53:15
9            8      10 2021-06-14 14:46:59             NaT
10           8      11 2021-06-14 14:46:59             NaT
11           9      12 2021-06-17 09:10:15             NaT
12          10      13 2021-06-15 02:49:05             NaT

Or with blank space:

df['duration'] = df['duration'].where(
    df['duration'] > pd.Timedelta(seconds=0),
    ''
)

    CustomerID  Order#                Time         duration
0            1       1 2021-06-16 19:11:21                 
1            2       2 2021-06-17 19:24:19                 
2            3       3 2021-06-16 19:22:42                 
3            4       4 2021-06-14 14:16:50                 
4            4       5 2021-06-14 18:17:48  0 days 04:00:58
5            5       6 2021-06-16 21:11:06                 
6            6       7 2021-06-15 17:02:19                 
7            7       8 2021-06-13 21:01:36                 
8            8       9 2021-06-16 18:40:14  2 days 03:53:15
9            8      10 2021-06-14 14:46:59                 
10           8      11 2021-06-14 14:46:59                 
11           9      12 2021-06-17 09:10:15                 
12          10      13 2021-06-15 02:49:05

DataFrame and Imports:

import pandas as pd

df = pd.DataFrame({
    'CustomerID': [1, 2, 3, 4, 4, 5, 6, 7, 8, 8, 8, 9, 10],
    'Order#': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13],
    'Time': ['2021-06-16 19:11:21', '2021-06-17 19:24:19',
             '2021-06-16 19:22:42', '2021-06-14 14:16:50',
             '2021-06-14 18:17:48', '2021-06-16 21:11:06',
             '2021-06-15 17:02:19', '2021-06-13 21:01:36',
             '2021-06-16 18:40:14', '2021-06-14 14:46:59',
             '2021-06-14 14:46:59', '2021-06-17 09:10:15', 
             '2021-06-15 02:49:5']
})

You guys are amazing. This is exactly what i was looking for. Thank you so much. — Slavisha84, Jun 20 '21 at 02:24

score 1 · Answer 3 · answered Jun 20 '21 at 02:24

You can use a groupby on the CustomerID column as you tried, and then calculate the difference between the min and max of the Time column for each unique CustomerID, and merge this with the original DataFrame.

## read in the data
df = pd.read_clipboard()
df.reset_index(inplace=True)
df['CustomerID'] = df['level_1']
df = df.drop(columns=['level_0','level_1'])
df['Time'] = pd.to_datetime(df['Order#'] + ' ' + df['Time'])
df['Order#'] = df.index.values + 1

time_diffs = df.groupby('CustomerID').apply(lambda x: x.Time.max() - x.Time.min()).reset_index().rename(columns={0:'time_diffs'})

df = df.merge(time_diffs, on='CustomerID')

Result of df:

    CustomerID  Order#                Time      time_diffs
0            1       1 2021-06-16 19:11:21 0 days 00:00:00
1            2       2 2021-06-17 19:24:19 0 days 00:00:00
2            3       3 2021-06-16 19:22:42 0 days 00:00:00
3            4       4 2021-06-14 14:16:50 0 days 04:00:58
4            4       5 2021-06-14 18:17:48 0 days 04:00:58
5            5       6 2021-06-16 21:11:06 0 days 00:00:00
6            6       7 2021-06-15 17:02:19 0 days 00:00:00
7            7       8 2021-06-13 21:01:36 0 days 00:00:00
8            8       9 2021-06-16 18:40:14 2 days 03:53:15
9            8      10 2021-06-14 14:46:59 2 days 03:53:15
10           8      11 2021-06-14 14:46:59 2 days 03:53:15
11           9      12 2021-06-17 09:10:15 0 days 00:00:00
12          10      13 2021-06-15 02:49:05 0 days 00:00:00

score 1 · Accepted Answer · answered Jun 20 '21 at 02:36

Try with np.ptp

df['new'] = df.groupby('CustomerID')['Time'].transform(np.ptp).\
                mask(lambda x : x.eq('00:00:00') | x.duplicated(),'')
df
    CustomerID  Order#                Time              new
0            1       1 2021-06-16 19:11:21                 
1            2       2 2021-06-17 19:24:19                 
2            3       3 2021-06-16 19:22:42                 
3            4       4 2021-06-14 14:16:50  0 days 04:00:58
4            4       5 2021-06-14 18:17:48                 
5            5       6 2021-06-16 21:11:06                 
6            6       7 2021-06-15 17:02:19                 
7            7       8 2021-06-13 21:01:36                 
8            8       9 2021-06-16 18:40:14  2 days 03:53:15
9            8      10 2021-06-14 14:46:59                 
10           8      11 2021-06-14 14:46:59                 
11           9      12 2021-06-17 09:10:15                 
12          10      13 2021-06-15 02:49:05

Python - How to create running list of occurrence of values and calculate time difference between first and last?

4 Answers4