Preface: I'm newish, but have searched for hours here and in the pandas documentation without success. I've also read Wes's book.
I am modeling stock market data for a hedge fund, and have a simple MultiIndexed-DataFrame with tickers, dates(daily), and fields. The sample here is from Bloomberg. 3 months - Dec. 2016 through Feb. 2017, 3 tickers(AAPL, IBM, MSFT).
import numpy as np
import pandas as pd
import os
# get data from Excel
curr_directory = os.getcwd()
filename = 'Sample Data File.xlsx'
filepath = os.path.join(curr_directory, filename)
df = pd.read_excel(filepath, sheetname = 'Sheet1', index_col = [0,1], parse_cols = 'A:D')
# sort
df.sort_index(inplace=True)
# sample of the data
df.head(15)
Out[4]:
PX_LAST PX_VOLUME
Security Name date
AAPL US Equity 2016-12-01 109.49 37086862
2016-12-02 109.90 26527997
2016-12-05 109.11 34324540
2016-12-06 109.95 26195462
2016-12-07 111.03 29998719
2016-12-08 112.12 27068316
2016-12-09 113.95 34402627
2016-12-12 113.30 26374377
2016-12-13 115.19 43733811
2016-12-14 115.19 34031834
2016-12-15 115.82 46524544
2016-12-16 115.97 44351134
2016-12-19 116.64 27779423
2016-12-20 116.95 21424965
2016-12-21 117.06 23783165
df.tail(15)
Out[5]:
PX_LAST PX_VOLUME
Security Name date
MSFT US Equity 2017-02-07 63.43 20277226
2017-02-08 63.34 18096358
2017-02-09 64.06 22644443
2017-02-10 64.00 18170729
2017-02-13 64.72 22920101
2017-02-14 64.57 23108426
2017-02-15 64.53 17005157
2017-02-16 64.52 20546345
2017-02-17 64.62 21248818
2017-02-21 64.49 20655869
2017-02-22 64.36 19292651
2017-02-23 64.62 20273128
2017-02-24 64.62 21796800
2017-02-27 64.23 15871507
2017-02-28 63.98 23239825
When I calculate daily price changes, like this, it seems to work, only the first day is NaN, as it should be:
df.head(5)
Out[7]:
PX_LAST PX_VOLUME px_change_%
Security Name date
AAPL US Equity 2016-12-01 109.49 37086862 NaN
2016-12-02 109.90 26527997 0.003745
2016-12-05 109.11 34324540 -0.007188
2016-12-06 109.95 26195462 0.007699
2016-12-07 111.03 29998719 0.009823
But daily 30 Day Volume doesn't. It should only be NaN for the first 29 days, but is NaN for all of it:
# daily change from 30 day volume - doesn't work
df['30_day_volume'] = df.groupby(level=0,group_keys=True)['PX_VOLUME'].rolling(window=30).mean()
df['volume_change_%'] = (df['PX_VOLUME'] - df['30_day_volume']) / df['30_day_volume']
df.iloc[:,3:].tail(40)
Out[12]:
30_day_volume volume_change_%
Security Name date
MSFT US Equity 2016-12-30 NaN NaN
2017-01-03 NaN NaN
2017-01-04 NaN NaN
2017-01-05 NaN NaN
2017-01-06 NaN NaN
2017-01-09 NaN NaN
2017-01-10 NaN NaN
2017-01-11 NaN NaN
2017-01-12 NaN NaN
2017-01-13 NaN NaN
2017-01-17 NaN NaN
2017-01-18 NaN NaN
2017-01-19 NaN NaN
2017-01-20 NaN NaN
2017-01-23 NaN NaN
2017-01-24 NaN NaN
2017-01-25 NaN NaN
2017-01-26 NaN NaN
2017-01-27 NaN NaN
2017-01-30 NaN NaN
2017-01-31 NaN NaN
2017-02-01 NaN NaN
2017-02-02 NaN NaN
2017-02-03 NaN NaN
2017-02-06 NaN NaN
2017-02-07 NaN NaN
2017-02-08 NaN NaN
2017-02-09 NaN NaN
2017-02-10 NaN NaN
2017-02-13 NaN NaN
2017-02-14 NaN NaN
2017-02-15 NaN NaN
2017-02-16 NaN NaN
2017-02-17 NaN NaN
2017-02-21 NaN NaN
2017-02-22 NaN NaN
2017-02-23 NaN NaN
2017-02-24 NaN NaN
2017-02-27 NaN NaN
2017-02-28 NaN NaN
As pandas seems to have been designed specifically for finance, I'm surprised this isn't straightforward.
Edit: I've tried some other ways as well.
- Tried converting it into a Panel (3D), but didn't find any built in functions for Windows except to convert to a DataFrame and back, so no advantage there.
- Tried to create a pivot table, but couldn't find a way to reference just the first level of the MultiIndex.
df.index.levels[0]
or...levels[1]
wasn't working.
Thanks!