Compare two columns based on last N rows in a pandas DataFrame

Question

I want to groupby "ts_code" and calculate percentage between one column max and min value from another column after max based on last N rows for each group. Specifically,

df

ts_code high low
0   A   20  10
1   A   30  5
2   A   40  20
3   A   50  10
4   A   20  30
5   B   20  10
6   B   30  5
7   B   40  20
8   B   50  10
9   B   20  30

Goal

Below is my expected result

   ts_code  high low l3_high_low_pct_chg    l4_high_low_pct_chg
    0   A   20  10  NA  NA
    1   A   30  5   NA  NA
    2   A   40  20  0.5 NA
    3   A   50  10  0.8 0.8
    4   A   20  30  0.8 0.8
    5   B   50  10  NA  NA
    6   B   30  5   NA  NA
    7   B   40  20  0.9 NA
    8   B   10  10  0.75    0.9
    9   B   20  30  0.75    0.75

ln_high_low_pct_chg(such as l3_high_low_pct_chg)= 1-(the min value of the low column after the peak)/(the max value of high column),on last N rows for each group and each row.

Try and problem

df['l3_highest']=df.groupby('ts_code')['high'].transform(lambda x: x.rolling(3).max())
df['l3_lowest']=df.groupby('ts_code')['low'].transform(lambda x: x.rolling(3).min())
df['l3_high_low_pct_chg']=1-df['l3_lowest']/df['l3_highest']

But it fails such that for second row, the l3_lowest would be 5 not 20. I don't know how to calculate percentage after peak.

For last 4 rows, at index=8, low=10,high=50,low=5, l4_high_low_pct_chg=0.9 , at index=9, high=40, low=10, l4_high_low_pct_chg=0.75

Another test data

If the rolling window is 52, for hy_code 880912 group and index 1252, l52_high_low_pct_chg would be 0.281131 and 880301 group and index 1251, l52_high_low_pct_chg would be 0.321471.

Can you explain how the 0.6 and 0.9 values are achieved? Which rows and values lead to those results? — aaossa, Mar 07 '22 at 15:36
For index 8, if we are looking at a rolling window of 4, then "the min value of the low column after the peak" would be 10, because the peak is 20 at index 7. The high between index 5 and 8 would be 50, so the formula would be 1-(10/50) = 0.80 Please explain what the formula is that gives you low=5. — Troy D, Mar 08 '22 at 14:37
@TroyD what I mean "peak" is that only based on high column. For index 8，the max is 50, then the index of min value from low column should be equal or greater than 5 for B group. And 5 is the lowest between 5 and 8 index. — Jack, Mar 09 '22 at 00:19
I changed my answer to use the 'high' column for peak instead of the 'low' column. Still a disagreement on the expected result though. For index 3, you have 1-10/50, which is 0.8, and we agree on that. For index 4, you have 0.4, which I assume is 1-30/50=0.4. So for index 3, you have the high and the low on the same row, so the low is from the same row as the peak. For index 4, the low is 30, presumably because you're not allowing the low to be 10 because that's on the same row as the peak. So it seems like the logic for index 3 and index 4 are not the same. — Troy D, Mar 09 '22 at 03:00
@TroyD Sorry，I miscalculated. Your solution is right. It's my fault that I didn't make clear at the beginning. — Jack, Mar 09 '22 at 06:37

Troy D · Accepted Answer · 2022-03-14T03:14:54.590

Grouping by 'ts_code' is just a trivial groupby() function. DataFrame.rolling() function is for single columns, so it's a tricky to apply it if you need data from multiple columns. You can use "from numpy_ext import rolling_apply as rolling_apply_ext" as in this example: Pandas rolling apply using multiple columns. However, I just created a function that manually groups the dataframe into n length sub-dataframes, then applies the function to calculate the value. idxmax() finds the index value of the peak of the low column, then we find the min() of the values that follow. The rest is pretty straightforward.

import numpy as np
import pandas as pd

df = pd.DataFrame([['A', 20, 10],
    ['A', 30, 5],
    ['A', 40, 20],
    ['A', 50, 10],
    ['A', 20, 30],
    ['B', 50, 10],
    ['B', 30, 5],
    ['B', 40, 20],
    ['B', 10, 10],
    ['B', 20, 30]],
    columns=['ts_code', 'high', 'low']
)
    
 
def custom_f(df, n):
    s = pd.Series(np.nan, index=df.index)

    def sub_f(df_):
        high_peak_idx = df_['high'].idxmax()
        min_low_after_peak = df_.loc[high_peak_idx:]['low'].min()
        max_high = df_['high'].max()
        return 1 - min_low_after_peak / max_high

    for i in range(df.shape[0] - n + 1):
        df_ = df.iloc[i:i + n]
        s.iloc[i + n - 1] = sub_f(df_)

    return s


df['l3_high_low_pct_chg'] = df.groupby("ts_code").apply(custom_f, 3).values
df['l4_high_low_pct_chg'] = df.groupby("ts_code").apply(custom_f, 4).values


print(df)

If you prefer to use the rolling function, this method gives the same output:

def rolling_f(rolling_df):
    df_ = df.loc[rolling_df.index]
    high_peak_idx = df_['high'].idxmax()
    min_low_after_peak = df_.loc[high_peak_idx:]["low"].min()
    max_high = df_['high'].max()
    return 1 - min_low_after_peak / max_high

df['l3_high_low_pct_chg'] = df.groupby("ts_code").rolling(3).apply(rolling_f).values[:, 0]
df['l4_high_low_pct_chg'] = df.groupby("ts_code").rolling(4).apply(rolling_f).values[:, 0]

print(df)

Finally, if you want to do a true rolling window calculation that avoids any index lookup, you can use the numpy_ext (https://pypi.org/project/numpy-ext/)

from numpy_ext import rolling_apply

def np_ext_f(rolling_df, n):
    def rolling_apply_f(high, low):
        return 1 - low[np.argmax(high):].min() / high.max()
    try:
        return pd.Series(rolling_apply(rolling_apply_f, n, rolling_df['high'].values, rolling_df['low'].values), index=rolling_df.index)
    except ValueError:
        return pd.Series(np.nan, index=rolling_df.index)


df['l3_high_low_pct_chg'] = df.groupby('ts_code').apply(np_ext_f, n=3).sort_index(level=1).values
df['l4_high_low_pct_chg'] = df.groupby('ts_code').apply(np_ext_f, n=4).sort_index(level=1).values

print(df)

output:

  ts_code  high  low  l3_high_low_pct_chg  l4_high_low_pct_chg
0       A    20   10                  NaN                  NaN
1       A    30    5                  NaN                  NaN
2       A    40   20                 0.50                  NaN
3       A    50   10                 0.80                 0.80
4       A    20   30                 0.80                 0.80
5       B    50   10                  NaN                  NaN
6       B    30    5                  NaN                  NaN
7       B    40   20                 0.90                  NaN
8       B    10   10                 0.75                 0.90
9       B    20   30                 0.75                 0.75

For large datasets, the speed of these operations becomes an issue. So, to compare the speed of these different methods, I created a timing function:

import time

def timeit(f):

    def timed(*args, **kw):
        ts = time.time()
        result = f(*args, **kw)
        te = time.time()
        print ('func:%r took: %2.4f sec' % \
          (f.__name__, te-ts))
        return result

    return timed

Next, let's make a large DataFrame, just by copying the existing dataframe 500 times:

df = pd.concat([df for x in range(500)], axis=0)
df = df.reset_index()

Finally, we run the three tests under a timing function:

@timeit
def method_1():
    df['l52_high_low_pct_chg'] = df.groupby("ts_code").apply(custom_f, 52).values
method_1()

@timeit
def method_2():
    df['l52_high_low_pct_chg'] = df.groupby("ts_code").rolling(52).apply(rolling_f).values[:, 0]
method_2()

@timeit
def method_3():
    df['l52_high_low_pct_chg'] = df.groupby('ts_code').apply(np_ext_f, n=52).sort_index(level=1).values
method_3()

Which gives us this output:

func:'method_1' took: 2.5650 sec
func:'method_2' took: 15.1233 sec
func:'method_3' took: 0.1084 sec

So, the fastest method is to use the numpy_ext, which makes sense because that's optimized for vectorized calculations. The second fastest method is the custom function I wrote, which is somewhat efficient because it does some vectorized calculations while also doing some Pandas lookups. The slowest method by far is using Pandas rolling function.

@Jack I added a rolling method that may look a little cleaner. Inside the function, it's either doing df_=df.loc[rolling_df.index] or df_=df.iloc[i:i+n], so either way it's iterating over an index range and looking up a subsection of the dataframe. — Troy D, Mar 09 '22 at 20:31
@Jack I'm afraid I don't understand what you mean. It will give you an error if the rolling window is greater than the size of the dataframe. I'm not sure if that's the problem you're referring to though. Can you be more specific about what you're expecting the answer to be? — Troy D, Mar 12 '22 at 04:10
Sorry, I could get right answer for latest data if dataframe is sorted by group column like hy_code and index should be re-sequence. — Jack, Mar 12 '22 at 06:53
But the data is large if my data from github is repeated 1000000, it will run slow(the window size is 52) — Jack, Mar 12 '22 at 07:11
@Jack, I evaluated the time of these three methods. The numpy_ext is very fast. The rolling function is very slow. — Troy D, Mar 12 '22 at 22:18
The numpy_ext method I ran came across AttributeError: 'Series' object has no attribute 'stack'. — Jack, Mar 13 '22 at 00:08
df.groupby('ts_code').apply(np_ext_f, n=3) should return a pd.DataFrame, not a pd.Series. Is there something different about the input DataFrame? It works in this google colab: https://colab.research.google.com/drive/1pZprjbPuf8Gfeqds0pfKfaFb3_HCmte3?usp=sharing — Troy D, Mar 13 '22 at 00:40
I could run it. But if n=52, it will get ValueError: array.size should be bigger than window — Jack, Mar 13 '22 at 01:33
would you mind trying testing https://github.com/zero-jack/data/blob/main/hy_data.csv — Jack, Mar 13 '22 at 06:39
@Jack I figured it out. The groupby function will return a DataFrame if the outputs of all of the groups are of the same length. If not, then it returns a Series. To override the DataFrame return, specify the Series index, index=rolling_df.index. Then you don't need to stack. Answer updated. — Troy D, Mar 13 '22 at 14:35
But if window size is more then group's length, it will get ValueError: array.size should be bigger than window. — Jack, Mar 14 '22 at 01:40
If your window is larger than the group, then all the values are nan, so you can throw in a try/except ValueError and return pd.Series(np.nan, index=rolling_df.index) — Troy D, Mar 14 '22 at 03:16

aaossa · Answer 2 · 2022-03-09T14:06:25.863

For my solution, we'll use .groupby("ts_code") then .rolling to process groups of certain size and a custom_function. This custom function will take each group, and instead of applying a function directly on the received values, we'll use those values to query the original dataframe. Then, we can calculate the values as you expect by finding the row where the "high" peak is, then look the following rows to find the minimum "low" value and finally calculate the result using your formula:

def custom_function(group, df):
    # Query the original dataframe using the group values
    group = df.loc[group.values]
    # Calculate your formula
    high_peak_row = group["high"].idxmax()
    min_low_after_peak = group.loc[high_peak_row:, "low"].min()
    return 1 - min_low_after_peak / group.loc[high_peak_row, "high"]


# Reset the index to roll over that column and be able query the original dataframe
df["l3_high_low_pct_chg"] = df.reset_index().groupby("ts_code")["index"].rolling(3).apply(custom_function, args=(df,)).values
df["l4_high_low_pct_chg"] = df.reset_index().groupby("ts_code")["index"].rolling(4).apply(custom_function, args=(df,)).values

Output:

  ts_code  high  low  l3_high_low_pct_chg  l4_high_low_pct_chg
0       A    20   10                  NaN                  NaN
1       A    30    5                  NaN                  NaN
2       A    40   20                 0.50                  NaN
3       A    50   10                 0.80                 0.80
4       A    20   30                 0.80                 0.80
5       B    50   10                  NaN                  NaN
6       B    30    5                  NaN                  NaN
7       B    40   20                 0.90                  NaN
8       B    10   10                 0.75                 0.90
9       B    20   30                 0.75                 0.75

We can take this idea further an only group once:

groups = df.reset_index().groupby("ts_code")["index"]
for n in [3, 4]:
    df[f"l{n}_high_low_pct_chg"] = groups.rolling(n).apply(custom_function, args=(df,)).values

Are you using it on your real data or the sample data? Also, you said that Troy D second method does not provide the answer you want, but here you said that the output of my method is not the same as that one. So are we both wrong? Is his first method correct? — aaossa, Mar 11 '22 at 02:51
Sorry, I could get right answer for latest data if dataframe is sorted by group column like hy_code and index should be re-sequence. — Jack, Mar 12 '22 at 06:54
But the data is large if my data from github is repeated 1000000, it will run slow(the window size is 52). — Jack, Mar 12 '22 at 07:11

Compare two columns based on last N rows in a pandas DataFrame

2 Answers2