0

Let's take two datasets:

import pandas as pd 
import numpy as np
df = pd.DataFrame([1, 2, 3, 2, 5, 4, 3, 6, 7])

check_df = pd.DataFrame([3, 2, 5, 4, 3, 6, 4, 2, 1])

I want to do the following thing:

  1. If any of numbers df[0:3] is greater than check_df[0], then we return 1 and 0 otherwise
  2. If any of numbers df[1:4] is greater than check_df[1] then we return 1 and 0 otherwise
  3. And so on...

It can be done, by rolling function and custom function:

def custom_fun(x: pd.DataFrame):
    return (x > float(check_df.iloc[0])).any()

And then by combining this with apply function:

df.rolling(3, min_periods = 3).apply(custom_fun).shift(-2)

The main problem in my solution, is that I always compare with check_df[0], whereas in i-th rolling window, I should compare with check_df[i], but I have no idea how it can be specified in the rolling function. Could you please give me a hand in this problem?

John
  • 1,849
  • 2
  • 13
  • 23
  • IIUC, this was already solved here: https://stackoverflow.com/questions/73065778/compare-two-pandas-dataframes-in-the-most-efficient-way/73066990#73066990. You can just compare `check_df[i]` with the maximum of the rolling window of `df[i:i+3]` – ko3 Jul 22 '22 at 07:10
  • Do you need processing only one column? – jezrael Jul 22 '22 at 07:15
  • Yes only one ;)) – John Jul 22 '22 at 07:26

1 Answers1

1

IIUC, you could use the first index of x, for example, with first_valid_index:

def custom_fun(x: pd.DataFrame):
    return (x > float(check_df.iloc[x.first_valid_index()])).any()


res = df.rolling(3, min_periods=3).apply(custom_fun).shift(-2)

print(res)

Output

     0
0  0.0
1  1.0
2  0.0
3  1.0
4  1.0
5  0.0
6  1.0
7  NaN
8  NaN

As an alternative, use:

def custom_fun(x: pd.DataFrame):
    return (x > float(check_df.iloc[x.index[0]])).any()
Dani Mesejo
  • 61,499
  • 6
  • 49
  • 76