2

I am creating a percentile rank over a rolling window of time and would like help refining my approach.

My DataFrame has a multi-index with the first level set to datetime and the second set to an identifier. Ultimately, I’d like the rolling window to evaluate the trailing n periods, including the current period, and produce the corresponding percentile ranks.

I referenced the posts shown below but found they were working with the data a bit differently than how I intend to. In those posts, the final functions group results by identifier and then by datetime, whereas I'm looking to use rolling panels of data in my function (dates and identifiers).

using rolling functions on multi-index dataframe in pandas

Panda rolling window percentile rank

This is an example of what I am after.

Create a sample DataFrame:

num_days = 5
np.random.seed(8675309)

stock_data = {
    "AAPL": np.random.randint(1, max_value, size=num_days),
    "MSFT": np.random.randint(1, max_value, size=num_days),
    "WMT": np.random.randint(1, max_value, size=num_days),
    "TSLA": np.random.randint(1, max_value, size=num_days)
}

dates = pd.date_range(
    start="2013-01-03", 
    periods=num_days, 
    freq=BDay()
)

sample_df = pd.DataFrame(stock_data, index=dates)
sample_df = sample_df.stack().to_frame(name='data')
sample_df.index.names = ['date', 'ticker']

Which outputs:

date       ticker      
2013-01-03 AAPL       2
           MSFT      93
           TSLA      39
           WMT       21
2013-01-04 AAPL     141
           MSFT      43
           TSLA     205
           WMT       20
2013-01-07 AAPL     256
           MSFT      93
           TSLA     103
           WMT       25
2013-01-08 AAPL     233
           MSFT      60
           TSLA      13
           WMT      104
2013-01-09 AAPL      19
           MSFT     120
           TSLA     282
           WMT      293

The code below breaks out the sample_df into 2 day increments and produces a rank vs. ranking over a rolling window of time. So it's close, but not what I'm after.

sample_df.reset_index(level=1, drop=True)[['data']] \
.apply(
    lambda x: x.groupby(pd.Grouper(level=0, freq='2d')).rank()
)

I then tried what's shown below without much luck either.

from scipy.stats import rankdata

def rank(x):
    return rankdata(x, method='ordinal')[-1]

sample_df.reset_index(level=1, drop=True) \
.rolling(window="2d", min_periods=1) \
.apply(
    lambda x: rank(x)
)

I finally arrived at the output I'm looking for but the formula seems a bit contrived, so I'm hoping to identify a more elegant approach if one exists.

import numpy as np
import pandas as pd
from pandas.tseries.offsets import BDay

window_length = 1
target_column = "data"

def rank(df, target_column, ids, window_length):

    percentile_ranking = []
    list_of_ids = []

    date_index = df.index.get_level_values(0).unique()

    for date in date_index:
        rolling_start_date = date - BDay(window_length)
        first_date = date_index[0] + BDay(window_length)
        trailing_values = df.loc[rolling_start_date:date, target_column]

        # Only calc rolling percentile after the rolling window has lapsed
        if date < first_date:
            pass
        else:
            percentile_ranking.append(
                df.loc[date, target_column].apply(
                    lambda x: stats.percentileofscore(trailing_values, x, kind="rank")
                )
            )

            list_of_ids.append(df.loc[date, ids])

    ranks, output_ids = pd.concat(percentile_ranking), pd.concat(list_of_ids)

    df = pd.DataFrame(
        ranks.values, index=[ranks.index, output_ids], columns=["percentile_rank"]
         )

    return df

ranks = rank(
    sample_df.reset_index(level=1), 
    window_length=1, 
    ids='ticker', 
    target_column="data"
)

sample_df.join(ranks)

I get the feeling that my rank function is more than what's needed here. I appreciate any ideas/feedback to help in simplifying this code to arrive at the output below. Thank you!

                   data  percentile_rank
date       ticker                       
2013-01-03 AAPL       2              NaN
           MSFT      93              NaN
           TSLA      39              NaN
           WMT       21              NaN
2013-01-04 AAPL     141             87.5
           MSFT      43             62.5
           TSLA     205            100.0
           WMT       20             25.0
2013-01-07 AAPL     256            100.0
           MSFT      93             50.0
           TSLA     103             62.5
           WMT       25             25.0
2013-01-08 AAPL     233             87.5
           MSFT      60             37.5
           TSLA      13             12.5
           WMT      104             75.0
2013-01-09 AAPL      19             25.0
           MSFT     120             62.5
           TSLA     282             87.5
           WMT      293            100.0

1 Answers1

2

Edited: The original answer was taking 2d groups without the rolling effect, and just grouping the first two days that appeared. If you want rolling by every 2 days:

  1. Dataframe pivoted to keep the dates as index and ticker as columns
pivoted = sample_df.reset_index().pivot('date','ticker','data')

Output

ticker      AAPL    MSFT    TSLA    WMT
date                
2013-01-03  2       93       39      21
2013-01-04  141     43      205      20
2013-01-07  256     93      103      25
2013-01-08  233     60       13     104
2013-01-09  19     120      282     293
  1. Now we can apply a rolling function and consider all stocks in the same window within that rolling
from scipy.stats import rankdata

def pctile(s):
    wdw = sample_df.loc[s.index,:].values.flatten() ##get all stock values in the period
    ranked = rankdata(wdw) / len(wdw)*100 ## their percentile
    return ranked[np.where(wdw == s[len(s)-1])][0] ## return this value's percentile

pivoted_pctile = pivoted.rolling('2D').apply(pctile, raw=False)

Output

ticker      AAPL    MSFT    TSLA    WMT
date                
2013-01-03   25.0   100.0    75.0    50.0
2013-01-04   87.5    62.5   100.0    25.0
2013-01-07  100.0    50.0    75.0    25.0
2013-01-08   87.5    37.5    12.5    75.0
2013-01-09   25.0    62.5    87.5   100.0

To get the original format back, we just melt the results:

pd.melt(pivoted_pctile.reset_index(),'date')\
    .sort_values(['date', 'ticker']).reset_index()

Output

                    value
date    ticker  
2013-01-03  AAPL     25.0
            MSFT    100.0
            TSLA     75.0
            WMT      50.0
2013-01-04  AAPL     87.5
            MSFT     62.5
            TSLA    100.0
            WMT      25.0
2013-01-07  AAPL    100.0
            MSFT     50.0
            TSLA     75.0
            WMT      25.0
2013-01-08  AAPL     87.5
            MSFT     37.5
            TSLA     12.5
            WMT      75.0
2013-01-09  AAPL     25.0
            MSFT     62.5
            TSLA     87.5
            WMT     100.0

If you prefer in one execution:

pd.melt(
    sample_df\
    .reset_index()\
    .pivot('date','ticker','data')\
    .rolling('2D').apply(pctile, raw=False)\
    .reset_index(),'date')\
    .sort_values(['date', 'ticker']).set_index(['date','ticker'])

Note that on day 7 this is different than what you displayed. This is actually rolling, so in day 7, because there is no day 6, the values are ranked only for that day, as the window of data is only 4 values and windows don't look forward. This differs from your result for that day.

Original

Is this something you might be looking for? I combined the groupby on the date (2 days) with transform so the number of observations is the same as the series provided. As you can see I kept the first observation of the window group.

df = sample_df.reset_index()

df['percentile_rank'] = df.groupby([pd.Grouper(key='date',freq='2D')]['data']\
                           .transform(lambda x: x.rank(ascending=True)/len(x)*100)

Output

Out[19]: 
         date ticker  data  percentile_rank
0  2013-01-03   AAPL     2             12.5
1  2013-01-03   MSFT    93             75.0
2  2013-01-03    WMT    39             50.0
3  2013-01-03   TSLA    21             37.5
4  2013-01-04   AAPL   141             87.5
5  2013-01-04   MSFT    43             62.5
6  2013-01-04    WMT   205            100.0
7  2013-01-04   TSLA    20             25.0
8  2013-01-07   AAPL   256            100.0
9  2013-01-07   MSFT    93             50.0
10 2013-01-07    WMT   103             62.5
11 2013-01-07   TSLA    25             25.0
12 2013-01-08   AAPL   233             87.5
13 2013-01-08   MSFT    60             37.5
14 2013-01-08    WMT    13             12.5
15 2013-01-08   TSLA   104             75.0
16 2013-01-09   AAPL    19             25.0
17 2013-01-09   MSFT   120             50.0
18 2013-01-09    WMT   282             75.0
19 2013-01-09   TSLA   293            100.0
realr
  • 3,652
  • 6
  • 23
  • 34
  • Thank you for the quick reply! It looks close to what I'm after except the values on the 9th are not being ranked relative to the 8th and 9th. As a result rows 17 - 19 end up being different from what I originally posted. I just ran the block of code in bold and it looks like `2013-01-09` is ranked in isolation as opposed to being ranked alongside `2013-01-08` too. **df.groupby([pd.Grouper(key='date',freq='2D')])['data']\ .transform(lambda x: x.rank())** I ran into this earlier with pd.Grouper() too. Any thoughts? – user11963761 Aug 24 '19 at 04:44
  • Thank you for the update @calestini, this is an interesting approach! The dataframe (df) I intend to use this on is quite large relative to the sample (3,700 dates and between 0 and 1,000 tickers per date), so .pivot() ends up taking a df of 35mb to over 1GB which makes manipulating it difficult (I actually can't get this to run). Given the size of my df would you say the for loop I initially posted is the best approach? – user11963761 Aug 25 '19 at 20:46
  • yeah it might be @user11963761 – realr Aug 25 '19 at 23:07