Pandas: Resample data by taking a weighted average of the top nth percentile of data within the resample window

Question

I have the following data

import pandas as pd
import numpy as np
from datetime import datetime, timedelta

date_today = pd.Timestamp(1513393355.5, unit='s')
days = pd.date_range(date_today, date_today + timedelta(1), freq='s')

np.random.seed(seed=1111)
data_price = np.random.randint(2, high=10, size=len(days))
data_quantity = np.random.randint(2, high=100, size=len(days))

df = pd.DataFrame({'ts': days, 'price': data_price, 'quantity': data_quantity})
df = df.set_index('ts')
print(df.head())

                         price  quantity
ts                                      
2017-12-16 03:02:35.500      6        30
2017-12-16 03:02:36.500      9        18
2017-12-16 03:02:37.500      7        85
2017-12-16 03:02:38.500      3        51
2017-12-16 03:02:39.500      6        19

I would like to resample the data into 10 minute intervals, rank the observations within each 10 minute window by increasing order of price, take the top 20% of the data after ranking and then calculate the weighted average price (i.e. price weighted by the quantity), and the sum of the quantity of the top 20% of the data.

There is a solution here which uses the groupby function to calculate the weighted average price. But i would like to apply the weighted average and sum only to the top 20% of the data.

I would like to do that on a static basis (i.e. apply the pandas resample function) and on a rolling basis every 1 minute with a 10 minute lookback period.

How can I do this elegantly with pandas? I am confused on how to do the ranking within the resample window.

Thank you!

Are you looking for a rolling 10min interval, or a fixed one, just splitting the data into 10-min pieces? — realr, Aug 22 '19 at 00:17
@calestini actually solutions for both would be appreciated. I will update the question — finstats, Aug 22 '19 at 00:27

realr · Answer 1 · 2019-08-22T16:50:48.207

Here it goes one attempt. I used a rolling window of 10 minutes, so the current value will represent anything from the past 10 minutes. For the sake of demonstration I changed to 10 seconds just so it's easier to verify the calculation.

The logic is:

Filter top 20% highest prices in the window
Calculate the weighted average of filtered data (sum of qty_pct * prices)
Note: If we have 1-4 obs, it will use the highest value, from 5-9, still the highest (1 > 20%), 10-14, 2 obs (2 > 20%), etc.

Edit: Realize I was calculating the top quantile, not the top 20% of the observations. Kept the original below, and here the corrected version:

def top_obs_wavg(s):
    ## greater than 20% of obs > valid observation
    if len(s) <  5: # not enought for 20%, keep the largest
        valid_index =s.nlargest(1).index
    else:
        valid_index = s.nlargest(len(s)//5).index ## keep all above 20%

    ## filter pct_qty of tot_qty for the window, only those for top price quantile (>20%)
    pct_qty = df.loc[valid_index,'quantity']/np.sum(df.loc[valid_index,'quantity'])

    ## return the sum of the valid percentages * valid prices > weigthed average.
    return np.sum(pct_qty*s[valid_index])

df['t20_wavg'] = df.rolling('10s')['price'].apply(top_obs_wavg, raw=False)

Output

                       price    quantity    t20_wavg
ts          
2017-12-16 03:02:35.500     6   30          6.000000
2017-12-16 03:02:36.500     9   18          9.000000
2017-12-16 03:02:37.500     7   85          9.000000
2017-12-16 03:02:38.500     3   51          9.000000
2017-12-16 03:02:39.500     6   19          9.000000
2017-12-16 03:02:40.500     4   72          9.000000
2017-12-16 03:02:41.500     6   47          9.000000
2017-12-16 03:02:42.500     2   64          9.000000
2017-12-16 03:02:43.500     8   21          9.000000
2017-12-16 03:02:44.500     6   46          8.461538
2017-12-16 03:02:45.500     5   40          8.461538
2017-12-16 03:02:46.500     8   13          8.000000
2017-12-16 03:02:47.500     2   99          8.000000
2017-12-16 03:02:48.500     8   19          8.000000
2017-12-16 03:02:49.500     6   60          8.000000

Using Quantiles

def top_quantile_wavg(s):
    ## greater than 20% quantile > valid observation
    is_valid = s >= s.quantile()
    valid_index = s.index[is_valid]

    ## filter pct_qty of tot_qty for the window, only those for top price quantile (>20%)
    pct_qty = df.loc[valid_index,'quantity']/np.sum(df.loc[valid_index,'quantity'])

    ## return the sum of the valid percentages * valid prices > weigthed average.
    return np.sum(pct_qty*s[valid_index])

Then we can use pandas rolling class:

## change to 10T for 10 minutes
df['t20_wavg'] = df.rolling('10s')['price'].apply(top_quantile_wavg, raw=False)

Output

                          price     quantity    t20_wavg
ts          
2017-12-16 03:02:35.500     6       30          6.000000
2017-12-16 03:02:36.500     9       18          9.000000
2017-12-16 03:02:37.500     7       85          7.349515
2017-12-16 03:02:38.500     3       51          7.349515
2017-12-16 03:02:39.500     6       19          6.914474
2017-12-16 03:02:40.500     4       72          6.914474
2017-12-16 03:02:41.500     6       47          6.698492
2017-12-16 03:02:42.500     2       64          6.698492
2017-12-16 03:02:43.500     8       21          6.822727

Pandas: Resample data by taking a weighted average of the top nth percentile of data within the resample window

1 Answers1