For contextualization, in this dataframe each card validation in a bus transit system of a city is a row. In this city, there's a rule: if the next validation have been done within 90 minutes, it'll get 50% fare discount. That's what we call INTEGRACAO = True. However, it's valid just for group of two validations, in the third validation you can't get any discount.
Anyhow, if the fourth validation have also been done within 90 minutes in reference to the third, it'll get 50% fare discount too (INTEGRACAO = True), following this idea in the next validations.
Initial df
DATAHORA_UTILIZACAO CARTAO_USUARIO
314388 2020-03-04 08:06 1
322718 2020-03-04 08:12 1
324216 2020-03-04 08:13 1
431669 2020-03-04 10:22 1
457794 2020-03-04 11:03 1
483567 2020-03-04 11:37 1
534746 2020-03-04 12:31 1
586949 2020-03-04 13:31 1
602406 2020-03-04 13:53 1
604226 2020-03-04 13:55 1
691412 2020-03-04 15:50 1
698074 2020-03-04 15:58 1
699002 2020-03-04 15:59 1
771140 2020-03-04 17:02 1
838434 2020-03-04 17:43 1
956096 2020-03-04 18:59 1
Code
df_cartoes = df_cartoes.sort_values(by=['DATAHORA_UTILIZACAO', 'CARTAO_USUARIO'])
df_cartoes = df_cartoes.reset_index(drop=True)
df_cartoes.loc[:,'SEQ_VALIDACAO'] = df_cartoes.groupby('CARTAO_USUARIO').cumcount()
df_cartoes.loc[:,'INTERVALO_INTEGRACAO_SEG'] = df_cartoes.groupby('CARTAO_USUARIO')['DATAHORA_UTILIZACAO'].diff()
df_cartoes.loc[:,'INTERVALO_INTEGRACAO_SEG'] = df_cartoes['INTERVALO_INTEGRACAO_SEG'].apply(lambda x: x.total_seconds())
df_cartoes.loc[:,'INTERVALO_INTEGRACAO_MIN'] = df_cartoes['INTERVALO_INTEGRACAO_SEG'].div(60)
df_cartoes.loc[((df_cartoes['INTERVALO_INTEGRACAO_SEG'] > 300) & (df_cartoes['INTERVALO_INTEGRACAO_SEG'] <= 5400)), 'INTEGRACAO'] = True
df_cartoes.loc[:,'INTEGRACAO'] = df_cartoes['INTEGRACAO'].fillna(False)
for i in range(0,5):
df_cartoes.loc[((df_cartoes['INTEGRACAO'] == True) & ((df_cartoes.shift(1)['INTEGRACAO'] == True) & (df_cartoes.shift(1)['CARTAO_USUARIO'] == df_cartoes['CARTAO_USUARIO'])) & ((df_cartoes.shift(2)['INTEGRACAO'] == False) & (df_cartoes.shift(2)['CARTAO_USUARIO'] == df_cartoes['CARTAO_USUARIO']))), 'INTEGRACAO'] = False
Output df
DATAHORA_UTILIZACAO CARTAO_USUARIO SEQ_VALIDACAO INTERVALO_INTEGRACAO_SEG INTERVALO_INTEGRACAO_MIN INTEGRACAO
314388 2020-03-04 08:06 1 0 NaN NaN False
322718 2020-03-04 08:12 1 1 348.0 5.800000 True
324216 2020-03-04 08:13 1 2 63.0 1.050000 False
431669 2020-03-04 10:22 1 3 7784.0 129.733333 False
457794 2020-03-04 11:03 1 4 2442.0 40.700000 True
483567 2020-03-04 11:37 1 5 2048.0 34.133333 True
534746 2020-03-04 12:31 1 6 3240.0 54.000000 True
586949 2020-03-04 13:31 1 7 3591.0 59.850000 True
602406 2020-03-04 13:53 1 8 1297.0 21.616667 True
604226 2020-03-04 13:55 1 9 163.0 2.716667 False
691412 2020-03-04 15:50 1 10 6863.0 114.383333 False
698074 2020-03-04 15:58 1 11 468.0 7.800000 True
699002 2020-03-04 15:59 1 12 62.0 1.033333 False
771140 2020-03-04 17:02 1 13 3821.0 63.683333 True
838434 2020-03-04 17:43 1 14 2464.0 41.066667 True
956096 2020-03-04 18:59 1 15 4529.0 75.483333 True
Expected output df
DATAHORA_UTILIZACAO CARTAO_USUARIO SEQ_VALIDACAO INTERVALO_INTEGRACAO_SEG INTERVALO_INTEGRACAO_MIN INTEGRACAO
314388 2020-03-04 08:06 1 0 NaN NaN False
322718 2020-03-04 08:12 1 1 348.0 5.800000 True
324216 2020-03-04 08:13 1 2 63.0 1.050000 False
431669 2020-03-04 10:22 1 3 7784.0 129.733333 False
457794 2020-03-04 11:03 1 4 2442.0 40.700000 True
483567 2020-03-04 11:37 1 5 2048.0 34.133333 False
534746 2020-03-04 12:31 1 6 3240.0 54.000000 True
586949 2020-03-04 13:31 1 7 3591.0 59.850000 False
602406 2020-03-04 13:53 1 8 1297.0 21.616667 True
604226 2020-03-04 13:55 1 9 163.0 2.716667 False
691412 2020-03-04 15:50 1 10 6863.0 114.383333 False
698074 2020-03-04 15:58 1 11 468.0 7.800000 True
699002 2020-03-04 15:59 1 12 62.0 1.033333 False
771140 2020-03-04 17:02 1 13 3821.0 63.683333 True
838434 2020-03-04 17:43 1 14 2464.0 41.066667 False
956096 2020-03-04 18:59 1 15 4529.0 75.483333 True
Using a unique CARTAO_USUARIO in dataframe with this part of code I could get the expected output. However, with a lot of CARTAO_USUARIO I can't get my expected output.
for i in range(0,5):
df_cartoes.loc[((df_cartoes['INTEGRACAO'] == True) & ((df_cartoes.shift(1)['INTEGRACAO'] == True)) & ((df_cartoes.shift(2)['INTEGRACAO'] == False))), 'INTEGRACAO'] = False
Basically, the rule is that I can't get two True in sequence in the same CARTAO_USUARIO. I tried all the possibilities here in stackoverflow, but I couldn't find a solution.