Searching through columns for a specific pattern

Question

I have a pandas column named "A" which has values like-

Now I want to search through this column for the pattern 0 1 0 and identify the row in column 'B' corresponding to the 1 in the column 'A'.

For example

Now I want it to return 3 in column 'B'. Is there any other solution other than applying nested if else?

What do you want done for the first and last elements of `A`? They don't have both of the 0 values you're looking for, but arguably they match the pattern. Are they included or excluded? — Hans Musgrave, Aug 11 '18 at 14:00
@HansMusgrave it should match the pattern 0 1 0...so in the case of the first and last since they dont we exclude them — ubuntu_noob, Aug 11 '18 at 14:02
Got it. That kind of information should probably be included in the question btw. Is there guaranteed to only be a single row in `B` matching that pattern? If not, do you want all answers? — Hans Musgrave, Aug 11 '18 at 14:04

jezrael · Accepted Answer · 2018-08-11T14:38:19.277

You can use numpy for improve performance - a bit modified solution from this:

pat = [0,1,0]
N = len(pat)
df = pd.DataFrame({'B':range(4, 14), 'A':[0,0,1,0,0,1,0,0,1,0]})
print (df)
    B  A
0   4  0
1   5  0
2   6  1
3   7  0
4   8  0
5   9  1
6  10  0
7  11  0
8  12  1
9  13  0

def rolling_window(a, window):
    shape = a.shape[:-1] + (a.shape[-1] - window + 1, window)
    strides = a.strides + (a.strides[-1],)
    c = np.lib.stride_tricks.as_strided(a, shape=shape, strides=strides)
    return c

arr = df['A'].values
b = np.all(rolling_window(arr, N) == pat, axis=1)

print (rolling_window(arr, N))

[[0 0 1]
 [0 1 0]
 [1 0 0]
 [0 0 1]
 [0 1 0]
 [1 0 0]
 [0 0 1]
 [0 1 0]]

c = np.mgrid[0:len(b)][b]
#create indices of matched pattern
print (c)
[1 4 7]

#strides by column B indexed by indices of matched pattern    
d = rolling_window(df['B'].values, N)[c]
print (d)
[[ 5  6  7]
 [ 8  9 10]
 [11 12 13]]

#select second 'column'
e = d[:, 1].tolist()
print (e)
[6, 9, 12]

Can I get the index of the column B also where the number was found? — ubuntu_noob, Aug 12 '18 at 14:28
@ubuntu_noob You can replace `d = rolling_window(df['B'].values, N)[c] print (d)` to `d = rolling_window(df. index.values, N)[c] print (d)` and then `e = d[:, 1].tolist() print (e)` — jezrael, Aug 12 '18 at 15:26

score 0 · Answer 2 · answered Aug 11 '18 at 14:33

The following code starts by specifying the pattern you want to match. In your case, this was 0 1 0. You also specify which coordinate in that pattern you want to correspond to the index you're pulling from column B. You wanted the middle element which is the 1 coordinate in a 0-based indexing scheme.

From there, we're taking column A and shifting it with Series.shift(). By default, this includes NaN values for missing coordinates. The NaN won't match with 0 or 1 or any other value of interest, so we can directly compare that shifted column to whatever we're supposed to be matching and get the exact kind of True or False values that we want.

In order to match your entire pattern, we need to combine those values with a logical AND. To do that, we reduce each shifted column pairwise with s1 & s2. This returns a new column which is coordinate-wise the logical AND of the originals.

Finally, we use this boolean result which is a series with as many rows as the original DataFrame df, and we select from df['B'] using it. This returns a new series with just the values from df['B'] at the intended coordinates.

from functools import reduce

matching_values = (0, 1, 0)
matching_index = 1

df['B'][reduce(
    lambda s1, s2: s1 & s2,
    (df['A'].shift(i)==v for i, v in zip(
        xrange(-matching_index, len(matching_values)-matching_index),
        matching_values)))]

If using Python 2.x, you don't need to import reduce() first, but in Python 3.x the zip() doesn't build an intermediate list, saving on CPU and RAM resources.

Depending on what you're doing, this could easily be extracted into a function exposing the relevant parameters. Magic strings of A and B probably aren't ideal and would be appropriate choices. The matching_values and matching_index are other likely candidates.

score 0 · Answer 3 · answered Aug 11 '18 at 14:39

0

from scipy.signal import convolve
pat = [0,1,0]
df = pd.DataFrame({'B':range(4, 14), 'A':[0,0,1,0,0,1,0,0,1,0]})
s2 = convolve(df['A'],[0,1,0],mode = 'valid')
s2 = pd.Series(s2)
df.B.iloc[s2[s2==1].index + 1].values

o/p:

array([ 6,  9, 12])

o/p for your given example:

array([3])

answered Aug 11 '18 at 14:39

Krishna

6,107
2
40
43

I doubt its correctness. If anyone find any please comment. – Krishna Aug 11 '18 at 14:45
`df = pd.DataFrame({'B':range(4, 10), 'A':[0,1,0,1,1,1]})`, seems issue? – atline Aug 11 '18 at 15:51
@atline what is the issue with df = pd.DataFrame({'B':range(4, 10), 'A':[0,1,0,1,1,1]})? – ubuntu_noob Aug 12 '18 at 14:35

score 0 · Answer 4 · answered Aug 11 '18 at 15:27

Change your original data to make it suitable for more data:

import pandas as pd
o = pd.DataFrame({'A': [0, 1, 0, 1, 0, 0], 'B': [12, 14, 6, 3, 6, 8]})
b = o["A"]
m = [i+1 for (i, _) in enumerate(b) if i+2<len(b) and str(b[i])+str(b[i+1]) + str(b[i+2]) == '010']
print(o.loc[m]['B'].tolist())

So, for next input:

Will output:

[14, 3]

Searching through columns for a specific pattern

4 Answers4