pandas dataframe filter by sequence of values in a specific column

Question

I have a dataframe

A B C

1 2 3

2 3 4

3 8 7

I want to take only rows where there is a sequence of 3,4 in columns C (in this scenario - first two rows)

What will be the best way to do so?

@jezrael -- I don't think that's the right duplicate, OP is looking for a sequence not anywhere `in`? — Zero, Sep 05 '18 at 10:50

jezrael · Answer 1 · 2018-09-05T11:45:21.833

4

You can use rolling for general solution working with any pattern:

pat = np.asarray([3,4])
N = len(pat)

mask= (df['C'].rolling(window=N , min_periods=N)
              .apply(lambda x: (x==pat).all(), raw=True)
              .mask(lambda x: x == 0) 
              .bfill(limit=N-1)
              .fillna(0)
              .astype(bool))

df = df[mask]
print (df)
   A  B  C
0  1  2  3
1  2  3  4

Explanation:

use rolling.apply and test pattern
replace 0s to NaNs by mask
use bfill with limit for filling first NANs values by last previous one
fillna NaNs to 0
last cast to bool by astype

edited Sep 05 '18 at 11:45

answered Sep 05 '18 at 10:54

jezrael

822,522
95
1,334
1,252

why do you bfill? – oren_isp Sep 05 '18 at 11:11
@oren_isp - So explanation is necessary? Always is pattern with length `2` ? – jezrael Sep 05 '18 at 11:15
1

@oren_isp - OK, `bfill` is necessary for matching if first values of column matched pattern. Ii replace NaNs by 1. If not mached patter it repalce by `0` – jezrael Sep 05 '18 at 11:46

score 2 · Accepted Answer · answered Sep 05 '18 at 10:57

2

Use shift

In [1085]: s = df.eq(3).any(1) & df.shift(-1).eq(4).any(1)

In [1086]: df[s | s.shift()]
Out[1086]:
   A  B  C
0  1  2  3
1  2  3  4

answered Sep 05 '18 at 10:57

Zero

74,117
18
147
154

pandas dataframe filter by sequence of values in a specific column

2 Answers2