1

I have a dataframe column with 1s and 0s like this:

df['working'] = 
1
1
0
0
0
1
1
0
0
1

which represents when a machine is working (1) or stopped (0). I need to classify this stops depending on their length ie if there are less or equal than n consecutive 0s change all them to short-stop (2) if there are more than n, to long-stop (3). The expected result should look like this when applied over the example with n=2:

df[['working', 'result']]=
1    1
1    1
0    3
0    3
0    3
1    1
1    1
0    2
0    2
1    1

of course this is an example, my df has more than 1M rows.

I tried looping through it but it's really slow and also using this. But I couldn't achieve to transform it to my problem.

Can anyone help?. Thanks so much in advance.

Gamopo
  • 1,600
  • 1
  • 14
  • 22

2 Answers2

2

Here's one approach:

# Counter for each gruop where there is a change
m = df.working.ne(df.working.shift()).cumsum()
# mask where working is 0
eq0 = df.working.eq(0)
# Get a count of consecutive 0s
count = df[eq0].groupby(m[eq0]).transform('count')
# replace 0s accordingly
df.loc[eq0, 'result'] = np.where(count > 2, 3, 2).ravel()
# fill the remaining values with 1
df['result'] = df.result.fillna(1)

print(df)

    working  result
0        1     1.0
1        1     1.0
2        0     3.0
3        0     3.0
4        0     3.0
5        1     1.0
6        1     1.0
7        0     2.0
8        0     2.0
9        1     1.0
yatu
  • 86,083
  • 12
  • 84
  • 139
2

I hope Series.map with Series.value_counts should be used for improve performance:

n = 2
#compare 0 values
m = df['working'].eq(0)
#created groups only by mask
s = df['working'].cumsum()[m]
#counts only 0 groups
out = s.map(s.value_counts())
#set new values by mask
df['result'] = 1
df.loc[m, 'result'] = np.where(out > n, 3, 2)
print (df)
   working  result
0        1       1
1        1       1
2        0       3
3        0       3
4        0       3
5        1       1
6        1       1
7        0       2
8        0       2
9        1       1
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • @Gamopo - ya, is possibble test timings with real data? Because I think this solution is faster – jezrael Mar 17 '20 at 12:24