0

I have two dataframes with events logged in a system, I have sorted them by the hour. I want to count consecutive hours in both df’s and then look at either the hours or the consecutive hours to decide ‘state’ on the system based on both dataframes.

Like this: if an event is logged in dataframe 1, state = 0, the state can be zero until an event is logged in the other dataframe, then the state must be = 1. I wish to see how many times the system would change state, based on the dataframes.

Small selection of dataframeA:

data = [[datetime.datetime(2021,6,7,13,0), 1], [datetime.datetime(2021,7,30,8,0), 1], [datetime.datetime(2022,4,10,1,0), 49],
    [datetime.datetime(2022,4,10,2,0), 81], [datetime.datetime(2022,4,10,3,0), 67], [datetime.datetime(2022,4,10,4,0), 93],
    [datetime.datetime(2022,5,10,3,0), 81], [datetime.datetime(2022,5,10,4,0), 1], [datetime.datetime(2022,7,10,1,0), 106]]
df_A = pd.DataFrame(data, columns=['col_0', 'sum'])

Small selection of dataframeB:

data1 = [[datetime.datetime(2020,5,1,16,0), 7], [datetime.datetime(2020,5,1,17,0), 9], [datetime.datetime(2021,6,7,20,0), 8],
    [datetime.datetime(2021,7,30,15,0), 17], [datetime.datetime(2022,6,10,3,0), 32], [datetime.datetime(2022,8,10,13,0), 54],
    [datetime.datetime(2022,8,10,14,0), 19], [datetime.datetime(2022,8,10,15,0), 1], [datetime.datetime(2022,8,10,16,0), 3]]
df_B = pd.DataFrame(data1, columns=['col_0', 'sum'])

I found this post Count of a value in consecutive timestamp in pandas, and tried:

f = (df_A.groupby('sum')['col_0'].diff().ne(pd.Timedelta(hours=1)).groupby(df_A['sum']).cumsum())
fgroups = df_A.groupby(['sum',f])['col_0']
df_A_upd = df_A.assign(count = fgroups.transform('size'),
                          Period_start = groups.transform('first'),
                          Period_end = groups.transform('last'))

But this dosen't work, and I don't see the error. The result is:

    col_0               sum count   Period_start    Period_end
0   2021-06-07 13:00:00 1   1   2021-06-07 13:00:00 2021-06-07 13:00:00
1   2021-07-30 08:00:00 1   1   2021-07-30 08:00:00 2021-07-30 08:00:00
2   2022-04-10 01:00:00 49  1   2022-04-10 01:00:00 2022-04-10 01:00:00
3   2022-04-10 02:00:00 81  1   2022-04-10 02:00:00 2022-04-10 02:00:00
4   2022-04-10 03:00:00 67  1   2022-04-10 03:00:00 2022-04-10 03:00:00
5   2022-04-10 04:00:00 93  1   2022-04-10 04:00:00 2022-04-10 04:00:00
6   2022-05-10 03:00:00 81  1   2022-04-10 05:00:00 2022-04-10 05:00:00
7   2022-05-10 04:00:00 1   1   2022-04-10 06:00:00 2022-04-10 06:00:00
8   2022-07-10 01:00:00 106 1   2022-04-10 07:00:00 2022-04-10 07:00:00

As seen, it dosen't count consecutive hours. This is the first issue.

Then, to the second part I don't really know where to start. I guess I can solve it by looping, and comparing, but it will be ugly. I think there is a better way.

I would like some code that returns something like this:

frame   state    from                change_of_state
df_B    1        2020-05-01 16:00:00 0
df_A    0        2021-06-07 13:00:00 1
df_B    1        2021-06-07 20:00:00 2
df_A    0        2021-07-30 08:00:00 3
df_B    1        2021-07-30 15:00:00 4
df_A    0        2022-04-10 01:00:00 5
df_B    1        2022-06-10 03:00:00 6
df_A    0        2022-07-10 01:00:00 7
df_B    1        2022-08-10 13:00:00 8

Is there a smart and easy way to solve this, or do I need to setup some ugly looping?

I have checked up front by using list and set to see if any events occur in the same hour in df_A and df_B, and it does not. Would however be nice to check this while creating the results, and either print or store the value somehow if it's an overlap.

Miss.Pepper
  • 103
  • 10

1 Answers1

0
# First add a column to each df with its identity and concat them together, sorted
df_A['frame']='df_A'
df_B['frame']='df_B'
df_C=pd.concat([df_A, df_B]).sort_values('col_0')
# Now, if I understand the problem correctly df_B triggers on state and df_A triggers off
df_C.loc[df_C['frame']=='df_B', 'state']=1
df_C.loc[df_C['frame']=='df_A', 'state']=0
df_C['state']=df_C['state'].astype('int') # pandas doesn't like ints to be na so need to turn into int when all values are set
df_C=df_C[df_C['frame'].shift()!=df_C['frame']].copy(deep=True) # filter out all the rows where the previous row was same state/frame
df_C['change_of_state']=[x for x in range(df_C.shape[0])]
Dean MacGregor
  • 11,847
  • 9
  • 34
  • 72