1

I have one datafrmae which includes two columns flag and flag1, i want to check if flag column values greater than 1 for 5 times or greater than 5 times continuous flag1 value should change to 1

here is example

df=pd.DataFrame({'flag':[0,0,1,1,1,1,1,1,1,0,0,0],'flag1':[0,0,0,0,0,0,1,0,0,0,0,0]})
Edward
  • 563
  • 3
  • 11
  • What happens if there is a string of 20 '1' ? Woud it be a) 00001000000000000000 or b) 00001000010000100001 ? – EBDS Oct 22 '21 at 04:28

3 Answers3

1

Idea is create consecutive counts and then test 5 if equal:

a = df['flag'].eq(1)
#https://stackoverflow.com/a/52718619/2901002
b = a.cumsum()
df['new'] = b.sub(b.mask(a).ffill().fillna(0)).eq(5).astype(int)


print (df)
    flag  flag1  new
0      0      0    0
1      0      0    0
2      1      0    0
3      1      0    0
4      1      0    0
5      1      0    0
6      1      1    1
7      1      0    0
8      1      0    0
9      0      0    0
10     0      0    0
11     0      0    0

Detail:

print (b.sub(b.mask(a).ffill().fillna(0)))
0     0.0
1     0.0
2     1.0
3     2.0
4     3.0
5     4.0
6     5.0
7     6.0
8     7.0
9     0.0
10    0.0
11    0.0
Name: flag, dtype: float64
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
1

setup

import pandas as pd
df=pd.DataFrame({'flag':[0,0,1,1,1,1,1,1,1,0,0,0],'flag1':[0,0,0,0,0,0,1,0,0,0,0,0]})

solution

rolling_sum = df["flag"].rolling(5).sum()
df["check"] = ((rolling_sum == 5) & (rolling_sum.diff() == 1)).astype(int)

    flag  flag1  check
0      0      0      0
1      0      0      0
2      1      0      0
3      1      0      0
4      1      0      0
5      1      0      0
6      1      1      1
7      1      0      0
8      1      0      0
9      0      0      0
10     0      0      0
11     0      0      0
Riley
  • 2,153
  • 1
  • 6
  • 16
0

2 versions of the solution, slow and fast for a len(df) = 3300000

Slow:

%%time
d = 1
for i,v in df.iterrows():
    if (v.flag == 1) and (d<5) :
        df.at[i,'flag1'] = 0
        d+=1
    elif (v.flag == 1):
        df.at[i,'flag1'] = 1
        d=1
    else:
        df.at[i,'flag1'] = 0
        d=1
df['flag2']=df['flag1'].astype(int)

Wall time: 4min 27s

Fast:

%%time
from math import floor
d = 1
df['flag1'] = (
        [(0,(d:=1))[0] if df.at[i,'flag']==0 
    else (0, (d := d+1))[0] if (d%5)!=0
    else (1, (d :=1 ))[0]
    for i in range(len(df))
    ] )

Wall time: 1min 1s

Ignore the "new" column.

flag flag1 flag2 new
0 0 0 0 0
1 0 0 0 0
2 1 0 0 0
3 1 0 0 0
4 1 0 0 0
5 1 0 0 0
6 1 1 1 1
7 1 0 0 0
8 1 0 0 0
9 0 0 0 0
10 0 0 0 0
11 0 0 0 0
12 1 0 0 0
13 1 0 0 0
14 1 0 0 0
15 1 0 0 0
16 1 1 1 1
17 1 0 0 0
18 1 0 0 0
19 1 0 0 0
20 1 0 0 0
21 1 1 1 0
22 1 0 0 0
23 1 0 0 0
24 1 0 0 0
25 0 0 0 0
26 0 0 0 0
27 1 0 0 0
28 0 0 0 0
29 1 0 0 0
30 1 0 0 0
31 1 0 0 0
32 1 0 0 0
33 0 0 0 0
34 0 0 0 0
35 1 0 0 0
36 1 0 0 0
37 1 0 0 0
38 1 0 0 0
39 1 1 1 1
40 1 0 0 0
41 1 0 0 0
42 0 0 0 0
43 0 0 0 0
44 0 0 0 0
45 1 0 0 0
46 1 0 0 0
47 1 0 0 0
48 1 0 0 0
49 1 1 1 1

For testing purpose, this is how I generated the data:

A = [0,0,1,1,1,1,1,1,1,0,0,0,1,1,1,1,1,1,1,1,1,1,1,1,1,0,0,1,0,1,1,1,1]
A = A * 100000
df=pd.DataFrame({'flag':A})
EBDS
  • 1,244
  • 5
  • 16
  • 1
    https://stackoverflow.com/questions/24870953/does-pandas-iterrows-have-performance-issues/24871316#24871316 – jezrael Oct 21 '21 at 06:20
  • 1
    @jezrael Thanks very much for the link. It is very useful for me. I tried different algorithms or pandas manipulation. And I also tried your method. But I found that the output are different. eg Index=21 when there is a long line of 1s. I've written to clarify with OP on the requirements on this. But for academic purpose, the different I find is the need to keep track of the "state change". I couldn't find a way to "remember" the last change when I use pandas manipulation. Once again, thanks very much. – EBDS Oct 22 '21 at 04:55