2

I have a dataframe like below:

Text  Label 
 a     NaN
 b     NaN
 c     NaN
 1     NaN
 2     NaN
 b     NaN
 c     NaN 
 a     NaN
 b     NaN
 c     NaN

Whenever the pattern "a,b,c" occurs downwards I want to label that part as a string such as 'Check'. Final dataframe should look like this:

Text  Label 
 a     Check
 b     Check
 c     Check
 1     NaN
 2     NaN
 b     NaN
 c     NaN 
 a     Check
 b     Check
 c     Check

What is the best way to do this. Thank you =)

s900n
  • 3,115
  • 5
  • 27
  • 35

3 Answers3

2

Here's a NumPy based approach leveraging broadcasting:

import numpy as np

w = df.Text.cumsum().str[-3:].eq('abc') # inefficient for large dfs
m = (w[w].index.values[:,None] + np.arange(-2,1)).ravel()
df.loc[m, 'Label'] = 'Check'

   Text  Label
0    a  Check
1    b  Check
2    c  Check
3    1    NaN
4    2    NaN
5    b    NaN
6    c    NaN
7    a  Check
8    b  Check
9    c  Check
yatu
  • 86,083
  • 12
  • 84
  • 139
1

Use this solution with numpy.where for general solution:

arr = df['Text']
pat = list('abc')
N = len(pat)
def rolling_window(a, window):
    shape = a.shape[:-1] + (a.shape[-1] - window + 1, window)
    strides = a.strides + (a.strides[-1],)
    c = np.lib.stride_tricks.as_strided(a, shape=shape, strides=strides)
    return c

b = np.all(rolling_window(arr, N) == pat, axis=1)
c = np.mgrid[0:len(b)][b]

d = [i  for x in c for i in range(x, x+N)]
df['label'] = np.where(np.in1d(np.arange(len(arr)), d), 'Check', np.nan)
print (df)
  Text  Label  label
0    a    NaN  Check
1    b    NaN  Check
2    c    NaN  Check
3    1    NaN    nan
4    2    NaN    nan
5    b    NaN    nan
6    c    NaN    nan
7    a    NaN  Check
8    b    NaN  Check
9    c    NaN  Check
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
0

Good old shift and bfill work as well (for small number of steps):

s = df.Text.eq('c') & df.Text.shift().eq('b') & df.Text.shift(2).eq('a')
df.loc[s, 'Label'] = 'Check'
df.Label.bfill(limit=2, inplace=True)

Output:

  Text  Label
0    a  Check
1    b  Check
2    c  Check
3    1    NaN
4    2    NaN
5    b    NaN
6    c    NaN
7    a  Check
8    b  Check
9    c  Check
Quang Hoang
  • 146,074
  • 10
  • 56
  • 74