Transforming financial data from postgres to pandas dataframe for use with Zipline

Question

I'm new to Pandas and Zipline, and I'm trying to learn how to use them (and use them with this data that I have). Any sorts of tips, even if no full solution, would be much appreciated. I have tried a number of things, and have gotten quite close, but run into indexing issues, Exception: Reindexing only valid with uniquely valued Index objects, in particular. [Pandas 0.10.0, Python 2.7]

I'm trying to transform monthly returns data I have for thousands of stocks in postgres from the form:

ticker_symbol :: String, monthly_return :: Float, date :: Timestamp

e.g.

AAPL, 0.112, 28/2/1992
GS, 0.13, 30/11/1981
GS, -0.23, 22/12/1981

NB: The frequency of the reporting is monthly, but there is going to be considerable NaN data here, as not all of the over 6000 companies I have here are going to be around at the same time.

…to the form described below, which is what Zipline needs to run its backtester. (I think. Can Zipline's backtester work with monthly data like this, easily? I know it can, but any tips for doing this?)

The below is a DataFrame (of timeseries? How do you say this?), in the format I need:

> data:

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 2268 entries, 1993-01-04 00:00:00+00:00 to 2001-12-31 00:00:00+00:00
Data columns:
AA      2268  non-null values
AAPL    2268  non-null values
GE      2268  non-null values
IBM     2268  non-null values
JNJ     2268  non-null values
KO      2268  non-null values
MSFT    2268  non-null values
PEP     2268  non-null values
SPX     2268  non-null values
XOM     2268  non-null values
dtypes: float64(10)

The below is a TimeSeries, and is in the format I need.

> data.AAPL:

Date
1993-01-04 00:00:00+00:00    73.00
1993-01-05 00:00:00+00:00    73.12
...

2001-12-28 00:00:00+00:00    36.15
2001-12-31 00:00:00+00:00    35.55
Name: AAPL, Length: 2268

Note, there isn't return data here, but prices instead. They're adjusted (by Zipline's load_from_yahoo—though, from reading the source, really by functions in pandas) for dividends, splits, etc, so there's an isomorphism (less the initial price) between that and my return data (so, no problem here).

(EDIT: Let me know if you'd like me to write what I have, or attach my iPython notebook or a gist; I just doubt it'd be helpful, but I can absolutely do it if requested.)

score 3 · Accepted Answer · answered Jan 25 '13 at 22:36

3

I suspect you are trying to set the date as the index too early. My suggestion would be to first set_index as date and company name, then you can unstack the company name and resample.

Something like this:

In [11]: df1
Out[11]: 
  ticker_symbol  monthly_return                date
0          AAPL           0.112 1992-02-28 00:00:00
1            GS           0.130 1981-11-30 00:00:00
2            GS          -0.230 1981-12-22 00:00:00

df2 = df2.set_index(['date','ticker_symbol'])
df3 = df2.unstack(level=1)
df4 = df.resample('M')

In [14]: df2
Out[14]: 
                          monthly_return
date       ticker_symbol                
1992-02-28 AAPL                    0.112
1981-11-30 GS                      0.130
1981-12-22 GS                     -0.230

In [15]: df3
Out[15]: 
               monthly_return      
ticker_symbol            AAPL    GS
date                               
1981-11-30                NaN  0.13
1981-12-22                NaN -0.23
1992-02-28              0.112   NaN

In [16]: df4
Out[16]: 
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 124 entries, 1981-11-30 00:00:00 to 1992-02-29 00:00:00
Freq: M
Data columns:
(monthly_return, AAPL)    1  non-null values
(monthly_return, GS)      2  non-null values
dtypes: float64(2)

answered Jan 25 '13 at 22:36

Andy Hayden

359,921
101
625
535

Wow, that's superb. Thanks, Andy. One question: in `df4`, the columns are named `(monthly returns, TICKER)`; what does this mean? This is different than the names Zipline returns (just `TICKER`). Additionally, are the columns TimeSeries? (I haven't tested yet…) Finally… any tips on becoming quickly proficient with pandas? Thanks again. – Isaac Jan 25 '13 at 22:44
1

@IsaacHodes the column is a MultiIndex (which is like a double header, and useful if there were more columns in the `df2`, but not so useful here), you can "correct" this via `df2.columns = df2.columns.get_level_values(1)`. I'll have a think about advice about learning pandas (a good start could be to look through some other people's SO questions..) Best of luck :) – Andy Hayden Jan 25 '13 at 22:54
Appreciate it. Thanks again. – Isaac Jan 25 '13 at 22:57
@IsaacHodes my advice: buy [**the** book](http://shop.oreilly.com/product/0636920023784.do), it's very good. – Andy Hayden Jan 26 '13 at 02:00
I've got it, I just need to sit down and go through it deliberately! So far, I love it quite a bit. Thanks for the help! – Isaac Jan 26 '13 at 03:37
For posterity: in my actual, messier dataset, I needed to also use `groupby(levels=[0,1]).last()` to remove duplicate indices so I could `unstack(level=1)` the dataframe, and then, to get the final result, I had to call ['return'] on the dataframe: e.g. with Andy's `df4`,, `df4['return']` got me the DataFrame I needed. Now I have a DataFrame with over 6k labeled TimeSeries. Sweet! – Isaac Jan 26 '13 at 05:46

Transforming financial data from postgres to pandas dataframe for use with Zipline

1 Answers1