I have two dataframes with events logged in a system, I have sorted them by the hour. I want to count consecutive hours in both df’s and then look at either the hours or the consecutive hours to decide ‘state’ on the system based on both dataframes.
Like this: if an event is logged in dataframe 1, state = 0, the state can be zero until an event is logged in the other dataframe, then the state must be = 1. I wish to see how many times the system would change state, based on the dataframes.
Small selection of dataframeA:
data = [[datetime.datetime(2021,6,7,13,0), 1], [datetime.datetime(2021,7,30,8,0), 1], [datetime.datetime(2022,4,10,1,0), 49],
[datetime.datetime(2022,4,10,2,0), 81], [datetime.datetime(2022,4,10,3,0), 67], [datetime.datetime(2022,4,10,4,0), 93],
[datetime.datetime(2022,5,10,3,0), 81], [datetime.datetime(2022,5,10,4,0), 1], [datetime.datetime(2022,7,10,1,0), 106]]
df_A = pd.DataFrame(data, columns=['col_0', 'sum'])
Small selection of dataframeB:
data1 = [[datetime.datetime(2020,5,1,16,0), 7], [datetime.datetime(2020,5,1,17,0), 9], [datetime.datetime(2021,6,7,20,0), 8],
[datetime.datetime(2021,7,30,15,0), 17], [datetime.datetime(2022,6,10,3,0), 32], [datetime.datetime(2022,8,10,13,0), 54],
[datetime.datetime(2022,8,10,14,0), 19], [datetime.datetime(2022,8,10,15,0), 1], [datetime.datetime(2022,8,10,16,0), 3]]
df_B = pd.DataFrame(data1, columns=['col_0', 'sum'])
I found this post Count of a value in consecutive timestamp in pandas, and tried:
f = (df_A.groupby('sum')['col_0'].diff().ne(pd.Timedelta(hours=1)).groupby(df_A['sum']).cumsum())
fgroups = df_A.groupby(['sum',f])['col_0']
df_A_upd = df_A.assign(count = fgroups.transform('size'),
Period_start = groups.transform('first'),
Period_end = groups.transform('last'))
But this dosen't work, and I don't see the error. The result is:
col_0 sum count Period_start Period_end
0 2021-06-07 13:00:00 1 1 2021-06-07 13:00:00 2021-06-07 13:00:00
1 2021-07-30 08:00:00 1 1 2021-07-30 08:00:00 2021-07-30 08:00:00
2 2022-04-10 01:00:00 49 1 2022-04-10 01:00:00 2022-04-10 01:00:00
3 2022-04-10 02:00:00 81 1 2022-04-10 02:00:00 2022-04-10 02:00:00
4 2022-04-10 03:00:00 67 1 2022-04-10 03:00:00 2022-04-10 03:00:00
5 2022-04-10 04:00:00 93 1 2022-04-10 04:00:00 2022-04-10 04:00:00
6 2022-05-10 03:00:00 81 1 2022-04-10 05:00:00 2022-04-10 05:00:00
7 2022-05-10 04:00:00 1 1 2022-04-10 06:00:00 2022-04-10 06:00:00
8 2022-07-10 01:00:00 106 1 2022-04-10 07:00:00 2022-04-10 07:00:00
As seen, it dosen't count consecutive hours. This is the first issue.
Then, to the second part I don't really know where to start. I guess I can solve it by looping, and comparing, but it will be ugly. I think there is a better way.
I would like some code that returns something like this:
frame state from change_of_state
df_B 1 2020-05-01 16:00:00 0
df_A 0 2021-06-07 13:00:00 1
df_B 1 2021-06-07 20:00:00 2
df_A 0 2021-07-30 08:00:00 3
df_B 1 2021-07-30 15:00:00 4
df_A 0 2022-04-10 01:00:00 5
df_B 1 2022-06-10 03:00:00 6
df_A 0 2022-07-10 01:00:00 7
df_B 1 2022-08-10 13:00:00 8
Is there a smart and easy way to solve this, or do I need to setup some ugly looping?
I have checked up front by using list and set to see if any events occur in the same hour in df_A and df_B, and it does not. Would however be nice to check this while creating the results, and either print or store the value somehow if it's an overlap.