0

I have a data frame with data:

  • Transaction group - superior
  • Application number - could be different in one transaction group
  • Created time – data time
  • Status - application or calculation.

And I want to calculate time between Created time for each pair in status application – calculation. Data doesn’t have column ‘pair’ , to be better understand I labeled them A, B, C in order.

pairs   Transaction group   Application number  Created time    Status
A   221110000363    HB202211100902000100    2022-11-10 09:05    application
A   221110000363    HB202211100902000100    2022-11-10 13:42    calculation
B   221110000363    HB202211100902000200    2022-11-10 14:02    application
B   221110000363    HB202211100902000200    2022-11-10 14:07    calculation
C   221110000363    HB202211100902000200    2022-11-10 14:43    application
C   221110000363    HB202211100902000200    2022-11-10 15:03    calculation

I was trying to split it on two df: one with all applications for transaction group - df1:

pairs   Transaction group   Application number  start_time  Status
A   221110000363    HB202211100902000100    2022-11-10 09:05    application
B   221110000363    HB202211100902000200    2022-11-10 14:02    application
C   221110000363    HB202211100902000200    2022-11-10 14:43    application

and another with all calculations - df2:

pairs Transaction group Application number  stop_time   Status
    A   221110000363    HB202211100902000100    2022-11-10 13:42    calculation
    B   221110000363    HB202211100902000200    2022-11-10 14:07    calculation
    C   221110000363    HB202211100902000200    2022-11-10 15:03    calculation

Created time in df1 I renamed to ‘statrt_time’ and in df2 to ‘stop_time’ But then I don’t know how to merge this two data frames that in one row I will have one pair with two times – start and stop. It is also important that it must be group by ‘Transaction group’ and ‘Application number’.

I’m looking for result like that:

pairs   Transaction group   Application number  start_time  stop_time
A   221110000363    HB202211100902000100    2022-11-10 09:05    2022-11-10 13:42
B   221110000363    HB202211100902000200    2022-11-10 14:02    2022-11-10 14:07
C   221110000363    HB202211100902000200    2022-11-10 14:43    2022-11-10 15:03

EDIT:

Logic to pairs:

In one transaction group I'm looking for application number and for this I'm looking for first lowest datetime in column ‘Status’ where the value is ‘application’ and then the first lowest datetime where in column ‘Status’ value is ‘calculation’. This is the first pair. Second pair is second lowest datetime in Status - application and then the second lowest datetime in Status - calculation. This is the second pair.

EDIT2

Data:

Application number  Created time        Transaction group   Status
HB202211100902000100    2022-11-10 09:05    221110000363    application
HB202211100902000100    2022-11-10 13:42    221110000363    calculation
HB202211100902000100    2022-11-10 14:02    221110000363    application
HB202211100902000100    2022-11-10 14:07    221110000363    calculation
HB202211100902000100    2022-11-10 14:43    221110000363    application
HB202211100902000100    2022-11-10 15:03    221110000363    calculation
HB202211100902000100    2022-11-10 16:24    221110000363    application
HB202205152239000200    2022-05-15 22:53    220515000252    application
HB202205152239000200    2022-05-16 14:57    220515000252    calculation
HB202205152253000100    2022-05-15 22:56    220515000252    application
HB202205152253000100    2022-05-16 14:56    220515000252    calculation
HB202205152257000100    2022-05-15 22:57    220515000252    application
HB202205152257000100    2022-05-16 14:56    220515000252    calculation
HB202205152259000100    2022-05-15 23:00    220515000252    application
HB202205152259000100    2022-05-16 14:56    220515000252    calculation
HB202205152301000100    2022-05-15 23:02    220515000252    application
HB202205152301000100    2022-05-16 14:56    220515000252    calculation
HB202205152302000100    2022-05-15 23:04    220515000252    application
HB202205152302000100    2022-05-16 14:55    220515000252    calculation
HB202205152305000100    2022-05-15 23:07    220515000252    application
HB202205152305000100    2022-05-16 14:55    220515000252    calculation
HB202205152307000100    2022-05-15 23:09    220515000252    application
HB202205152307000100    2022-05-16 14:54    220515000252    calculation
HB202205152312000100    2022-05-15 23:13    220515000252    application
HB202205152312000100    2022-05-16 14:54    220515000252    calculation
HB202205152313000100    2022-05-15 23:17    220515000252    application
HB202205152313000100    2022-05-16 14:54    220515000252    calculation

So I use this:

> df['pairs'] = df.groupby(['Status']).cumcount()

And my data looks like :

Application number  Created time    Transaction group   Status  pairs
0   HB202211100902000100    2022-11-10 09:05:28     221110000363    application     0
1   HB202211100902000100    2022-11-10 13:42:44     221110000363    calculation     0
2   HB202211100902000100    2022-11-10 14:02:43     221110000363    application     1
3   HB202211100902000100    2022-11-10 14:07:16     221110000363    calculation     1
4   HB202211100902000100    2022-11-10 14:43:54     221110000363    application     2
5   HB202211100902000100    2022-11-10 15:03:06     221110000363    calculation     2
6   HB202211100902000100    2022-11-10 16:24:26     221110000363    application     3
7   HB202205152239000200    2022-05-15 22:53:18     220515000252    application     4
8   HB202205152239000200    2022-05-16 14:57:14     220515000252    calculation     3
9   HB202205152253000100    2022-05-15 22:56:11     220515000252    application     5
10  HB202205152253000100    2022-05-16 14:56:55     220515000252    calculation     4
11  HB202205152257000100    2022-05-15 22:57:58     220515000252    application     6
12  HB202205152257000100    2022-05-16 14:56:47     220515000252    calculation     5

So the problem starts from 3rd pair.

After using pivot:

df = (df.pivot(index=['pairs', 'Transaction group', 'Application number'], columns='Status', values='Created time'))

Data looks that:

                                 Status     application             calculation
pairs   Transaction group   Application number      
0   221110000363    HB202211100902000100    2022-11-10 09:05:28     2022-11-10 13:42:44
1   221110000363    HB202211100902000100    2022-11-10 14:02:43     2022-11-10 14:07:16
2   221110000363    HB202211100902000100    2022-11-10 14:43:54     2022-11-10 15:03:06
3   220515000252    HB202205152239000200    NaT                     2022-05-16 14:57:14
        221110000363    HB202211100902000100    2022-11-10 16:24:26     NaT
4   220515000252    HB202205152239000200    2022-05-15 22:53:18     NaT
                        HB202205152253000100    NaT                     2022-05-16 14:56:55
5   220515000252    HB202205152253000100    2022-05-15 22:56:11     NaT
                        HB202205152257000100    NaT                     2022-05-16 14:56:47

So, in my opinion the most important thing is to make a good pairs.

  • It is unclear what is the logic to split in pairs? The `application number`? But B and C have same number. Sort by time and take in pairs? – buran Nov 30 '22 at 07:55
  • Logic to split in pairs: Transaction group > Application number > datetime. – zukowski2012 Nov 30 '22 at 08:10
  • _Transaction group > Application number > datetime_ - that is the original data, because all datetimes are unique – buran Nov 30 '22 at 08:18
  • Logic to pairs. In one transaction group I'm looking for application number and for this I'm looking for first lowest datetime in column ‘Status’ where the value is ‘application’ and then the first lowest datetime where in column ‘Status’ value is ‘calculation’. This is the first pair. Second pair is second lowest datetime in Status - application and then the second lowest datetime in Status - calculation. This is the second pair. – zukowski2012 Nov 30 '22 at 09:50

1 Answers1

0

Use a pivot and rename:

df['pairs'] = df.groupby('Status').cumcount()

(df.pivot(index=['pairs', 'Transaction group', 'Application number'],
         columns='Status', values='Created time')
   .rename(columns={'application': 'start_time', 'calculation': 'end_time'})
   .reset_index()
)

Output:

Status pairs  Transaction group    Application number          start_time            end_time
0          A       221110000363  HB202211100902000100 2022-11-10 09:05:00 2022-11-10 13:42:00
1          B       221110000363  HB202211100902000200 2022-11-10 14:02:00 2022-11-10 14:07:00
2          C       221110000363  HB202211100902000200 2022-11-10 14:43:00 2022-11-10 15:03:00
mozway
  • 194,879
  • 13
  • 39
  • 75
  • @buran see update, classical pivot without index ;) – mozway Nov 30 '22 at 07:57
  • @ mozway thank you! How to handle with situation when Transaction group doesn't have pair - end with status 'application'? That code make a pair with another value from other transaction group. I don't wan't it. I want to analyze only pairs in the transaction groups – zukowski2012 Nov 30 '22 at 13:41
  • You can `dropna()` before `reset_index` – mozway Nov 30 '22 at 13:42
  • The problem is with making a pairs. In one pair I have two different transaction group, two different application number and one row with start date and second with end date. Using 'dropna' it deletes all rows like that – zukowski2012 Nov 30 '22 at 14:22
  • If you restrict the `dropna` this shouldn't happen, can you provide an example? – mozway Nov 30 '22 at 14:26
  • Please look at the EDIT2 – zukowski2012 Nov 30 '22 at 15:03