1

I have a column A in a dataframe and I want to subset rows whenever they are in a specific range consecutively. For e.g, if the nth row is within (4.8,5.3) range and the n+1th row is within (4.8,5.3) range and the n+2nd row is in (-10.3,-9.7)

  Col A         ColB
  13.8           A
  20.2           A
  15.3           A
  4.9            A 
  5.2            A
  -9.8           A
  20.1           A
  4.5            A
  3.2            A
  -9.8           A
  5              A
  4.8            A
  -10            A
  12.2           A

For the above input, I would like the following subset of rows in another dataframe (the 3 consecutive rows which have values in the specified range):

 ColA        ColB     
  4.9          A
  5.2          A
 -9.8          A
   5           A
 4.8           A
 -10           A
 

I'm able to figure this out with a for loop but my dataframe has more than 70000 rows and it is very slow. (I have given only a sample dataframe here). Is there any more pythonic way to do this? Thanks!

ASGM
  • 11,051
  • 1
  • 32
  • 53
  • Even if your current method isn't working, it's helping to post it to make clear exactly what you're trying to do. Also, if you can make your example into code that can be copied and pasted into someone else's code, it will be easier for them to answer your question. – ASGM Jul 26 '20 at 11:18
  • just as a general warning, read [how to iterate over rows in a pandas dataframe](https://stackoverflow.com/q/16476924/6692898) – RichieV Jul 26 '20 at 12:06
  • Also look at [filter with multiple criteria] (https://stackoverflow.com/q/52045848/6692898) and try to combine it with `pandas.shift()` (https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.shift.html) – RichieV Jul 26 '20 at 12:17
  • When someone answers your question https://stackoverflow.com/help/someone-answers – RichieV Jul 31 '20 at 19:13

1 Answers1

0

So it is nice to have the sample data as returned by df.head(10).to_dict():

df = pd.DataFrame({'Col_A': [13.8, 20.2, 15.3, 4.9, 5.2, -9.8, 20.1, 4.5, 3.2, -9.8, 5, 4.8, -10, 12.2]})
df = df.assign(Col_B='A')

Now, instead of iterating through rows (which should be avoided according to the documentation) we can build the arrays with df.shift, filter according to the criteria. This will give us only the nth row; and for this specific case where we want nth and the following two rows we can make use of a rolling window.

Here is the code:

### shift twice to evaluate all conditions in one pass
nth1 = df.Col_A.shift(-1)
nth2 = df.Col_A.shift(-2)
mask = df.Col_A.gt(4.8) & df.Col_A.lt(5.3) & nth1.gt(4.8) & 
nth1.lt(5.3) & nth2.gt(-10.3) & nth2.lt(-9.7)

### include the nth1 & nth2 into mask
mask = mask.rolling(3).max().fillna(0).astype(bool) 
df2 = df.loc[mask]

print(mask)
print(df2)

Output

0     False
1     False
2     False
3      True
4      True
5      True
6     False
7     False
8     False
9     False
10    False
11    False
12    False
13    False
Name: Col_A, dtype: bool
   Col_A Col_B
3    4.9     A
4    5.2     A
5   -9.8     A
RichieV
  • 5,103
  • 2
  • 11
  • 24