I am creating a pandas dataframe with trading data (datetime, time, volume, price). I built the dataframe using several and identical files (each file represents a trading day) then I resample my dataframe using a 5 minutes interval.
I then calculate the return between each interval BUT I do not want to calculate the return from one day to another (i.e. the return between the last datapoint at day t and the first datapoint at day t+1).
list_=[]
big_df=pd.DataFrame()
#read file into pandas
for file in filelist:
#create panda dataframe
df=pd.read_hdf(file)
#Retrieve time and price
data= df.filter(['datetime','price'], axis=1)
data = data.set_index('datetime')
#Resample dataframe
data = data.resample('5T').mean().bfill().between_time('04:00', '19:00')
list_.append(data)
#concatenate them together
big_df = pd.concat(list_)
# compute log returns
ret_d = pd.DataFrame(100*np.log(big_df['price']).diff(1)*100)
The code above calculate the return for each interval including the return between 2 days. How can I exclude these returns? For instance, I do not want to calculate the return between day 1, 19:00 and day 2, 4:05 (please note that the first datapoint of a day could be anything after 4:05am; for instance 4:35, so we do not have the same number of datapoints each day).
My second problem is that I could not find a way to compute the weighted average mean of the price (using the volume) in my resampling (only the method .mean() is available in the pandas resample function according to the documentation). Is there any way to do this? Thank you.
Example:
In [1]: df = pd.DataFrame([[2017-01-04 18:51:00, 100,10], [2017-01-04 18:53:00, 101.5,50], [2017-01-04 18:58:00, 102.1], [2017-01-05 04:32:00, 102.6, 50], [2017-01-05 04:34:00, 102.7, 10], [2017-01-05 04:38:00, 103, 50]], columns=['datetime', 'price', 'volume'])
After puting 'datetime' as index, removing volume, and computing the weighted average price, the desired result should be the following dataframe:
pd.DataFrame([[2017-01-04 18:55:00, 101.25], [2017-01-04 19:00:00, 102], [2017-01-05 04:35:00, 102.62], [2017-01-05 04:40:00, 103]],['datetime', 'price'])
with: 101.25 = (101,5*50+100*10)/(50+10)
Finally, computing the log-return of the previous dataframe (excluding the return from a day-change), I should get:
[0.00320514*, 0**, 0.00162932***]
with: * log(102/101.25)
** 0 (since it is between 2 days)
*** log(103/102.62)