3

Preface: I'm newish, but have searched for hours here and in the pandas documentation without success. I've also read Wes's book.

I am modeling stock market data for a hedge fund, and have a simple MultiIndexed-DataFrame with tickers, dates(daily), and fields. The sample here is from Bloomberg. 3 months - Dec. 2016 through Feb. 2017, 3 tickers(AAPL, IBM, MSFT).

import numpy as np
import pandas as pd
import os

# get data from Excel
curr_directory = os.getcwd()
filename = 'Sample Data File.xlsx'
filepath = os.path.join(curr_directory, filename)
df = pd.read_excel(filepath, sheetname = 'Sheet1', index_col = [0,1], parse_cols = 'A:D')

# sort
df.sort_index(inplace=True)

# sample of the data
df.head(15)
Out[4]: 
                           PX_LAST  PX_VOLUME
Security Name  date                          
AAPL US Equity 2016-12-01   109.49   37086862
               2016-12-02   109.90   26527997
               2016-12-05   109.11   34324540
               2016-12-06   109.95   26195462
               2016-12-07   111.03   29998719
               2016-12-08   112.12   27068316
               2016-12-09   113.95   34402627
               2016-12-12   113.30   26374377
               2016-12-13   115.19   43733811
               2016-12-14   115.19   34031834
               2016-12-15   115.82   46524544
               2016-12-16   115.97   44351134
               2016-12-19   116.64   27779423
               2016-12-20   116.95   21424965
               2016-12-21   117.06   23783165

df.tail(15)
Out[5]: 
                           PX_LAST  PX_VOLUME
Security Name  date                          
MSFT US Equity 2017-02-07    63.43   20277226
               2017-02-08    63.34   18096358
               2017-02-09    64.06   22644443
               2017-02-10    64.00   18170729
               2017-02-13    64.72   22920101
               2017-02-14    64.57   23108426
               2017-02-15    64.53   17005157
               2017-02-16    64.52   20546345
               2017-02-17    64.62   21248818
               2017-02-21    64.49   20655869
               2017-02-22    64.36   19292651
               2017-02-23    64.62   20273128
               2017-02-24    64.62   21796800
               2017-02-27    64.23   15871507
               2017-02-28    63.98   23239825

When I calculate daily price changes, like this, it seems to work, only the first day is NaN, as it should be:

df.head(5)
Out[7]: 
                           PX_LAST  PX_VOLUME  px_change_%
Security Name  date                                       
AAPL US Equity 2016-12-01   109.49   37086862          NaN
               2016-12-02   109.90   26527997     0.003745
               2016-12-05   109.11   34324540    -0.007188
               2016-12-06   109.95   26195462     0.007699
               2016-12-07   111.03   29998719     0.009823

But daily 30 Day Volume doesn't. It should only be NaN for the first 29 days, but is NaN for all of it:

# daily change from 30 day volume - doesn't work
df['30_day_volume'] = df.groupby(level=0,group_keys=True)['PX_VOLUME'].rolling(window=30).mean()
df['volume_change_%'] = (df['PX_VOLUME'] - df['30_day_volume']) / df['30_day_volume']

df.iloc[:,3:].tail(40)
Out[12]: 
                           30_day_volume  volume_change_%
Security Name  date                                      
MSFT US Equity 2016-12-30            NaN              NaN
               2017-01-03            NaN              NaN
               2017-01-04            NaN              NaN
               2017-01-05            NaN              NaN
               2017-01-06            NaN              NaN
               2017-01-09            NaN              NaN
               2017-01-10            NaN              NaN
               2017-01-11            NaN              NaN
               2017-01-12            NaN              NaN
               2017-01-13            NaN              NaN
               2017-01-17            NaN              NaN
               2017-01-18            NaN              NaN
               2017-01-19            NaN              NaN
               2017-01-20            NaN              NaN
               2017-01-23            NaN              NaN
               2017-01-24            NaN              NaN
               2017-01-25            NaN              NaN
               2017-01-26            NaN              NaN
               2017-01-27            NaN              NaN
               2017-01-30            NaN              NaN
               2017-01-31            NaN              NaN
               2017-02-01            NaN              NaN
               2017-02-02            NaN              NaN
               2017-02-03            NaN              NaN
               2017-02-06            NaN              NaN
               2017-02-07            NaN              NaN
               2017-02-08            NaN              NaN
               2017-02-09            NaN              NaN
               2017-02-10            NaN              NaN
               2017-02-13            NaN              NaN
               2017-02-14            NaN              NaN
               2017-02-15            NaN              NaN
               2017-02-16            NaN              NaN
               2017-02-17            NaN              NaN
               2017-02-21            NaN              NaN
               2017-02-22            NaN              NaN
               2017-02-23            NaN              NaN
               2017-02-24            NaN              NaN
               2017-02-27            NaN              NaN
               2017-02-28            NaN              NaN

As pandas seems to have been designed specifically for finance, I'm surprised this isn't straightforward.

Edit: I've tried some other ways as well.

  • Tried converting it into a Panel (3D), but didn't find any built in functions for Windows except to convert to a DataFrame and back, so no advantage there.
  • Tried to create a pivot table, but couldn't find a way to reference just the first level of the MultiIndex. df.index.levels[0] or ...levels[1] wasn't working.

Thanks!

Josh D
  • 794
  • 4
  • 14
  • 31

2 Answers2

3

Can you try the following to see if it works?

df['30_day_volume'] = df.groupby(level=0)['PX_VOLUME'].rolling(window=30).mean().values

df['volume_change_%'] = (df['PX_VOLUME'] - df['30_day_volume']) / df['30_day_volume']
Allen Qin
  • 19,507
  • 8
  • 51
  • 67
  • Worked, thanks! I'm curious though about the explanation behind it - why does it help to add `.values` ? – Josh D May 09 '17 at 02:15
  • Seems from [here](http://pandas.pydata.org/pandas-docs/version/0.19.2/generated/pandas.DataFrame.values.html#pandas.DataFrame.values) that `values` is an attribute of the DataFrame that returns the NumPy representation of the DataFrame, and as Wes himself [says](https://stackoverflow.com/questions/10373660/converting-a-pandas-groupby-object-to-dataframe/10374456#10374456) a GroupBy object _is_ itself a DataFrame. – Josh D May 09 '17 at 02:25
  • 1
    The groupby and rolling function created a multi-index Series with duplicate index keys which caused problem during assignment to a DF column. The .values attribute takes out the values only from the Series which can be assigned to a DF column without problem. – Allen Qin May 09 '17 at 04:02
1

I can verify Allen's answer works when using pandas_datareader, modifying the index level for the groupby operation for the datareader multiindexing.

import pandas_datareader.data as web
import datetime

start = datetime.datetime(2016, 12, 1)
end = datetime.datetime(2017, 2, 28)
data = web.DataReader(['AAPL', 'IBM', 'MSFT'], 'yahoo', start, end).to_frame()

data['30_day_volume'] = data.groupby(level=1).rolling(window=30)['Volume'].mean().values

data['volume_change_%'] = (data['Volume'] - data['30_day_volume']) / data['30_day_volume']

# double-check that it computed starting at 30 trading days. 
data.loc['2017-1-17':'2017-1-30']

The original poster might try editing this line:

df['30_day_volume'] = df.groupby(level=0,group_keys=True)['PX_VOLUME'].rolling(window=30).mean()

to the following, using mean().values:

df['30_day_volume'] = df.groupby(level=0,group_keys=True)['PX_VOLUME'].rolling(window=30).mean().values

The data don't get properly aligned without this, resulting in NaN's.