Pandas dataframe - timeseries overnight returns

Question

I am creating a pandas dataframe with trading data (datetime, time, volume, price). I built the dataframe using several and identical files (each file represents a trading day) then I resample my dataframe using a 5 minutes interval.

I then calculate the return between each interval BUT I do not want to calculate the return from one day to another (i.e. the return between the last datapoint at day t and the first datapoint at day t+1).

    list_=[]
    big_df=pd.DataFrame()

    #read file into pandas
    for file in filelist:

        #create panda dataframe
        df=pd.read_hdf(file)
        #Retrieve time and price
        data= df.filter(['datetime','price'], axis=1)
        data = data.set_index('datetime')

        #Resample dataframe
        data = data.resample('5T').mean().bfill().between_time('04:00', '19:00')

        list_.append(data)

   #concatenate them together
   big_df = pd.concat(list_)

   # compute log returns
   ret_d = pd.DataFrame(100*np.log(big_df['price']).diff(1)*100)

The code above calculate the return for each interval including the return between 2 days. How can I exclude these returns? For instance, I do not want to calculate the return between day 1, 19:00 and day 2, 4:05 (please note that the first datapoint of a day could be anything after 4:05am; for instance 4:35, so we do not have the same number of datapoints each day).

My second problem is that I could not find a way to compute the weighted average mean of the price (using the volume) in my resampling (only the method .mean() is available in the pandas resample function according to the documentation). Is there any way to do this? Thank you.

Example:

In [1]: df = pd.DataFrame([[2017-01-04 18:51:00, 100,10], [2017-01-04 18:53:00, 101.5,50], [2017-01-04 18:58:00, 102.1], [2017-01-05 04:32:00, 102.6, 50], [2017-01-05 04:34:00, 102.7, 10], [2017-01-05 04:38:00, 103, 50]], columns=['datetime', 'price', 'volume'])

After puting 'datetime' as index, removing volume, and computing the weighted average price, the desired result should be the following dataframe:

pd.DataFrame([[2017-01-04 18:55:00, 101.25], [2017-01-04 19:00:00, 102], [2017-01-05 04:35:00, 102.62], [2017-01-05 04:40:00, 103]],['datetime', 'price'])

with: 101.25 = (101,5*50+100*10)/(50+10)

Finally, computing the log-return of the previous dataframe (excluding the return from a day-change), I should get:

[0.00320514*, 0**, 0.00162932***]

with: * log(102/101.25)

** 0 (since it is between 2 days)

*** log(103/102.62)

It would be fantastic if you had some sample data with the expected result. — Ted Petrou, Sep 12 '17 at 20:20
Hi Ted, I have added a picture with a step by step calculation. I am new in this website so I do not know how to submit this in a proper way. Hope it is clear enough — CTXR, Sep 12 '17 at 20:45
@CTXR, please read [how to make good reproducible pandas examples](http://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples) and edit your post correspondingly. — MaxU - stand with Ukraine, Sep 12 '17 at 20:59
Thanks. I will try this but I think the picture is self-explanatory. — CTXR, Sep 12 '17 at 22:53

Ken Syme · Accepted Answer · 2017-09-14T19:20:44.397

So I think I have figured out what you want - you are really asking about 2 quite separate things, the log return and the returns over days, but I think I have answered them both here. Your spreadsheet screenshot and the data in your example are inconsistent and not particularly simple to follow as mentioned in the comments, so let me know if this is the answer you expect.

Edited to incorporate comment:

import pandas as pd
import numpy as np

df = pd.DataFrame([
    ["2017-01-04 18:51:00", 100,10], 
    ["2017-01-04 18:53:00", 101.5,50], 
    ["2017-01-04 18:58:00", 102, 10], 
    ["2017-01-05 04:07:00", 101.9, 30], 
    ["2017-01-05 04:32:00", 102.6, 50], 
    ["2017-01-05 04:34:00", 102.7, 10], 
    ["2017-01-05 04:38:00", 103, 50]], columns=['datetime', 'price', 'volume'])

df['datetime'] = pd.to_datetime(df['datetime'])
df = df.set_index('datetime')
df['price_volume'] = df['price'] * df['volume']

df = df.resample("5T", label='right').agg(['sum', 'mean']).between_time('04:00', '19:00')

df['volume_weighted_price'] = df[('price_volume', 'sum')] / df[('volume', 'sum')]
df = df[['volume_weighted_price']]
df.columns = df.columns.droplevel(1)
df = df.groupby([pd.Grouper(level=0, freq='D', label='right')]).ffill()

df['log_return'] = np.log(df['volume_weighted_price']).diff(1)
print(df)

Which gives the resulting dataframe

                     volume_weighted_price  log_return
datetime                                              
2017-01-04 18:55:00             101.250000         NaN
2017-01-04 19:00:00             102.000000    0.007380
2017-01-05 04:00:00                    NaN         NaN
2017-01-05 04:05:00                    NaN         NaN
2017-01-05 04:10:00             101.900000         NaN
2017-01-05 04:15:00             101.900000    0.000000
2017-01-05 04:20:00             101.900000    0.000000
2017-01-05 04:25:00             101.900000    0.000000
2017-01-05 04:30:00             101.900000    0.000000
2017-01-05 04:35:00             102.616667    0.007008
2017-01-05 04:40:00             103.000000    0.003729

I first resample to ensure each 5 minute period exists, and sum and take the mean of all columns for calculating the volume weighted price. After calculating the price and rearranging the columns, I group by day and forward fill the prices. This gives each time period the previous periods price. Finally I calculate the returns.

Hi Ken, thanks a lot! This is exactly the idea. However, there is one thing missing (that was not in the example - sorry): my original dataframe is very heterogeneous in time: I have added the " .bfill() " in my aggregation. When I test the code you made, I end up with time freq that is not equal to 5 min (especially in the beginning of the day since there are no trades). I want to put the last price if there is no trade. Eg. when there's not trade between 5:05 am and 5:35am, I do not want a jump, I want to keep 5:05 5:10 5:15 etc. even if the price will be the last calculated (5:05 am) — CTXR, Sep 14 '17 at 00:38
@CTXR I have edited with a different solution which I think does what what you want. Please let me know how you get on an accept if it works! — Ken Syme, Sep 14 '17 at 19:21

Pandas dataframe - timeseries overnight returns

1 Answers1