2

I have the following dummy dataframe:

import pandas as pd
import numpy as np

def random_dates(start, end, n, freq, seed=None):
    if seed is not None:
        np.random.seed(seed)

    dr = pd.date_range(start, end, freq=freq)
    return pd.to_datetime(np.sort(np.random.choice(dr, n, replace=False)))

data = {'Timestamp': random_dates('2018-01-01', '2018-01-02', 21, 'H', seed=[3, 1415]), 
        'Group': [1,1,1,1,1,1,1,1,1,1,1,1,
                 2,2,2,2,2,2,2,2,2],
        
        'Event': ['A','A','A','B','A','A','A','B','A','A','A','B',
                'A','A','B','A','A','B','A','A','B']}

df = pd.DataFrame(data, columns = ['Timestamp', 'Group', 'Event'])
print(df)

             Timestamp  Group Event
0  2018-01-01 00:00:00      1    A
1  2018-01-01 01:00:00      1    A
2  2018-01-01 03:00:00      1    A
3  2018-01-01 04:00:00      1    B
4  2018-01-01 05:00:00      1    A
5  2018-01-01 06:00:00      1    A
6  2018-01-01 07:00:00      1    A
7  2018-01-01 08:00:00      1    B
8  2018-01-01 09:00:00      1    A
9  2018-01-01 12:00:00      1    A
10 2018-01-01 13:00:00      1    A
11 2018-01-01 14:00:00      1    B
12 2018-01-01 15:00:00      2    A
13 2018-01-01 17:00:00      2    A
14 2018-01-01 18:00:00      2    B
15 2018-01-01 19:00:00      2    A
16 2018-01-01 20:00:00      2    A
17 2018-01-01 21:00:00      2    B
18 2018-01-01 22:00:00      2    A
19 2018-01-01 23:00:00      2    A
20 2018-01-02 00:00:00      2    B

I want the dynamic rolling count of column 'Event' for each 'Group'. As it can be seen for example df['Group']==1 has the sequence of events:

A, A, A, B

where Event sequence occurs every third, thus has a sequence of 3, 1. While df['Group']==2 is:

A, A, B

where Event sequence occurs every second, thus has a sequence of 2, 1. Ideally I would to have:

Group Event Sequence 
1      A      3      
1      B      1      
1      A      3      
1      B      1      
1      A      3      
1      B      1      
2      A      2      
2      B      1      
2      A      2      
2      B      1      
2      A      2      
2      B      1      

So that I can plot Sequence in order for monitoring. By 'dynamic', as demonstrated, the occurrence of Event is changing, even within a Group! For example df['Group']==1 could see 3, 1, 3, 1, 2, 1 as well.

That would be also great to have the elapsed time calculated with each sequence of events. This can be calculated as the difference between the last and first Timestamp of Event of the each sequence for each group, we would have:

Group Event Sequence ElapsedTime
1      A      3      4
1      B      1      1
1      A      3      3
1      B      1      1
1      A      3      5
1      B      1      1
2      A      2      3
2      B      1      1
2      A      2      2
2      B      1      1
2      A      2      2
2      B      1      None

here ElapsedTime for the first row for first sequence of 'Event' A in Group 1 is calculated as:

df[df['Group']==1]['Timestamp'].iloc[2] - df[df['Group']==1]['Timestamp'].iloc[0]

and the second row for first sequence of 'Event' B in Group 1 is calculated as:

df[df['Group']==1]['Timestamp'].iloc[3] - df[df['Group']==1]['Timestamp'].iloc[2]

I have tried the rolling count, resample, groupby.cumcount() methods, and so on in pandas, and none of them are returning the results I am interested in. I am certain a complex combinations of these method in a groupby fashion should do the job (at least for first scenario), but I have spent enough time trying and searching, so far no success. It is beyond my current knowledge of pandas!

Once again your time and experience is very appreciated.

TwinPenguins
  • 475
  • 9
  • 17

1 Answers1

2

OK. I did not give up! It turned out that I needed another 1-2 hr to find a solution (largely inspired by this answer by @DSM), and hell I learned a lot, here for those who might come across such data transformation including the Elapsed time:

df['Lag'] = df['Timestamp'].shift(-1) 
df['Seq'] = df["Event"].groupby((df.Event != df.Event.shift()).cumsum()).transform('size')
df['SeqID'] = (df.Event != df.Event.shift()).cumsum()

df_grp = df.groupby(['Group','SeqID']).first().reset_index()
df_grp['Elapsed(min)'] =  (df.groupby(['Group','SeqID'])['Lag'].last() - df.groupby(['Group','SeqID'])['Timestamp'].first()).reset_index()[0]/ np.timedelta64(1, 'm')

df_grp = df_grp.drop(['Timestamp','Lag'],axis=1)
print(df_grp)


    Group  SeqID Event  Seq  Elapsed(min)
0       1      1     A    3         240.0
1       1      2     B    1          60.0
2       1      3     A    3         180.0
3       1      4     B    1          60.0
4       1      5     A    3         300.0
5       1      6     B    1          60.0
6       2      7     A    2         180.0
7       2      8     B    1          60.0
8       2      9     A    2         120.0
9       2     10     B    1          60.0
10      2     11     A    2         120.0
11      2     12     B    1           NaN

Happy Pandas-ing! ;)

TwinPenguins
  • 475
  • 9
  • 17