I have the following dummy dataframe:
import pandas as pd
import numpy as np
def random_dates(start, end, n, freq, seed=None):
if seed is not None:
np.random.seed(seed)
dr = pd.date_range(start, end, freq=freq)
return pd.to_datetime(np.sort(np.random.choice(dr, n, replace=False)))
data = {'Timestamp': random_dates('2018-01-01', '2018-01-02', 21, 'H', seed=[3, 1415]),
'Group': [1,1,1,1,1,1,1,1,1,1,1,1,
2,2,2,2,2,2,2,2,2],
'Event': ['A','A','A','B','A','A','A','B','A','A','A','B',
'A','A','B','A','A','B','A','A','B']}
df = pd.DataFrame(data, columns = ['Timestamp', 'Group', 'Event'])
print(df)
Timestamp Group Event
0 2018-01-01 00:00:00 1 A
1 2018-01-01 01:00:00 1 A
2 2018-01-01 03:00:00 1 A
3 2018-01-01 04:00:00 1 B
4 2018-01-01 05:00:00 1 A
5 2018-01-01 06:00:00 1 A
6 2018-01-01 07:00:00 1 A
7 2018-01-01 08:00:00 1 B
8 2018-01-01 09:00:00 1 A
9 2018-01-01 12:00:00 1 A
10 2018-01-01 13:00:00 1 A
11 2018-01-01 14:00:00 1 B
12 2018-01-01 15:00:00 2 A
13 2018-01-01 17:00:00 2 A
14 2018-01-01 18:00:00 2 B
15 2018-01-01 19:00:00 2 A
16 2018-01-01 20:00:00 2 A
17 2018-01-01 21:00:00 2 B
18 2018-01-01 22:00:00 2 A
19 2018-01-01 23:00:00 2 A
20 2018-01-02 00:00:00 2 B
I want the dynamic rolling count of column 'Event' for each 'Group'. As it can be seen for example df['Group']==1
has the sequence of events:
A, A, A, B
where Event sequence occurs every third, thus has a sequence of 3, 1. While df['Group']==2
is:
A, A, B
where Event sequence occurs every second, thus has a sequence of 2, 1. Ideally I would to have:
Group Event Sequence
1 A 3
1 B 1
1 A 3
1 B 1
1 A 3
1 B 1
2 A 2
2 B 1
2 A 2
2 B 1
2 A 2
2 B 1
So that I can plot Sequence in order for monitoring. By 'dynamic', as demonstrated, the occurrence of Event is changing, even within a Group! For example df['Group']==1
could see 3, 1, 3, 1, 2, 1 as well.
That would be also great to have the elapsed time calculated with each sequence of events. This can be calculated as the difference between the last and first Timestamp of Event of the each sequence for each group, we would have:
Group Event Sequence ElapsedTime
1 A 3 4
1 B 1 1
1 A 3 3
1 B 1 1
1 A 3 5
1 B 1 1
2 A 2 3
2 B 1 1
2 A 2 2
2 B 1 1
2 A 2 2
2 B 1 None
here ElapsedTime for the first row for first sequence of 'Event' A in Group 1 is calculated as:
df[df['Group']==1]['Timestamp'].iloc[2] - df[df['Group']==1]['Timestamp'].iloc[0]
and the second row for first sequence of 'Event' B in Group 1 is calculated as:
df[df['Group']==1]['Timestamp'].iloc[3] - df[df['Group']==1]['Timestamp'].iloc[2]
I have tried the rolling count, resample, groupby.cumcount() methods, and so on in pandas, and none of them are returning the results I am interested in. I am certain a complex combinations of these method in a groupby fashion should do the job (at least for first scenario), but I have spent enough time trying and searching, so far no success. It is beyond my current knowledge of pandas!
Once again your time and experience is very appreciated.