6

I have a dataframe (with columns 'a', 'b', 'c') on which I am doing a rolling-window.

I want to be able to filter the rolling window using one of the columns (say 'a') in the apply function like below

df.rolling(len(s),min_periods=0).apply(lambda x: x[[x['a']>10][0] if len(x[[x['a']>10]]) >=0 else np.nan)

The intention of above line is to select the first row in the rolling window whose 'a' column has value greater than 10. If there is no such row, then return nan.

But I am unable to do so and get the following error

IndexError: only integers, slices (`:`), ellipsis (`...`), numpy.newaxis (`None`) and integer or boolean arrays are valid indices

This means that I am not allowed to access the individual columns at all by this syntax. Is there any other way of doing this kind of thing?

ishan3243
  • 1,870
  • 4
  • 30
  • 49
  • Have you checked what `x[[x['a']>10][0]` is? – Andrew L Jun 21 '17 at 11:00
  • @AndrewL Apparently I am not allowed to access the 'a' column at all. The error is saying that index can only be integers, slices (`:`), ellipsis (`...`), numpy.newaxis (`None`) and integer or boolean arrays – ishan3243 Jun 21 '17 at 11:02
  • 1
    It would be super helpful to see sample data – Andrew L Jun 21 '17 at 11:06
  • 1
    By the way you aren't selecting column 'a' at all- you're applying this to the entire dataframe. You need to change to: `df['a'].rolling(len(s),min_periods=0).apply()` – Andrew L Jun 21 '17 at 11:10

2 Answers2

7

Your error stems from assuming what comes to the function inside apply is a dataframe, it is actually a ndarray not a dataframe.

Pandas dataframe apply works on each column/series of the dataframe, so any function passed to apply is applied along each column/series like an internal lambda. In case of windowed dataframe, apply takes each column/series inside the each window and passes to the function as ndarray and the function has to return only array of length 1 per one series per one window. Knowing this saves a lot of pain.

so in your case you cannot use any apply unless you have a complex function that remembers first value of the series a for each window.

For OP's case if a column of the window say a is meeting a condition, say > 10

  1. For case where a in the first row of a window meets condition it is same as searching in dataframe df[df['a']>10].

  2. For other conditions like a in second row of a window is > 10, checking the entire dataframe works except for the first window of the dataframe.

Following example demonstrates another way to solution.

import numpy as np
import pandas as pd
np.random.seed(123)
df = pd.DataFrame(np.random.randint(0,20,size=(20, 4)), columns=list('abcd'))

df looks like

    a   b   b   d
0   13  2   2   6
1   17  19  10  1
2   0   17  15  9
3   0   14  0   15
4   19  14  4   0
5   16  4   17  3
6   2   7   2   15
7   16  7   9   3
8   6   1   2   1
9   12  8   3   10
10  5   0   11  2
11  10  13  18  4
12  15  11  12  6
13  13  19  16  6
14  14  7   11  7
15  1   11  5   18
16  17  12  18  17
17  1   19  12  9
18  16  17  3   3
19  11  7   9   2

now to select a window if second row inside rolling window of a meets a condition a > 10 like in OP's question.

roll_window=5
search_index=1

df_roll = df['a'].rolling(roll_window)
df_y = df_roll.apply(lambda x:x[1] if x[1] > 10 else np.nan).dropna()

above line returns all values of a corresponding to condition a in second row of a window greater then 10. Note the values are right based on example dataframe above but the indexes are defined by how rolling window was centered.

4     17.0
7     19.0
8     16.0
10    16.0
12    12.0
15    15.0
16    13.0
17    14.0
19    17.0

to get the right index location and entire row inside the first dataframe

df.loc[df_y.index+searchindex-rollwindow+1]

returns

    a   b   b   d
1   17  19  10  1
4   19  14  4   0
5   16  4   17  3
7   16  7   9   3
9   12  8   3   10
12  15  11  12  6
13  13  19  16  6
14  14  7   11  7
16  17  12  18  17

one could also use np.array(df) and make a rolling slice corresponding to rolling window and filter the array using slices correspondingly.

suvy
  • 693
  • 6
  • 18
  • regarding first row of rolling window, that was just an example I gave. – ishan3243 Jun 21 '17 at 14:28
  • @ishan3243 other rows also should give similar answers, except for the first windows. I did make another generic solution for you though. – suvy Jun 21 '17 at 16:54
0

First, make the rolling window:

win = df['a'].rolling(len(s), min_periods=0)

Then make your condition (boolean array):

cond = win > 10

Finally:

idx = np.where(cond)[0]
return win.iloc[idx[0]] if len(idx) else np.nan
John Zwinck
  • 239,568
  • 38
  • 324
  • 436
  • I am not sure if I understand what is happening in the code. The cond variable does not make sense to me. Can you please explain? – ishan3243 Jun 21 '17 at 14:30
  • @John Zwinck your condition is bool not a bool array, so doesnt seem to work – suvy Jun 21 '17 at 15:01