4

I have a set of calculated OHLCVA daily securities data in a pandas dataframe like this:

>>> type(data_dy)
<class 'pandas.core.frame.DataFrame'>
>>> data_dy
              Open    High     Low   Close     Volume  Adj Close
Date                                                            
2012-12-28  140.64  141.42  139.87  140.03  148806700     134.63
2012-12-31  139.66  142.56  139.54  142.41  243935200     136.92
2013-01-02  145.11  146.15  144.73  146.06  192059000     140.43
2013-01-03  145.99  146.37  145.34  145.73  144761800     140.11
2013-01-04  145.97  146.61  145.67  146.37  116817700     140.72

[5 rows x 6 columns]

I'm using the following dictionary and the pandas resample function to convert the dataframe to monthly data:

>>> ohlc_dict = {'Open':'first','High':'max','Low':'min','Close': 'last','Volume': 'sum','Adj Close': 'last'}

>>> data_dy.resample('M', how=ohlc_dict, closed='right', label='right')
               Volume  Adj Close    High     Low   Close    Open
Date                                                            
2012-12-31  392741900     136.92  142.56  139.54  142.41  140.64
2013-01-31  453638500     140.72  146.61  144.73  146.37  145.11

[2 rows x 6 columns]

This does the calculations correctly, but I'd like to use the Yahoo! date convention for monthly data of using the first trading day of the period rather than the last calendar day of the period that pandas uses.

So I'd like the answer set to be:

               Volume  Adj Close    High     Low   Close    Open
Date                                                            
2012-12-28  392741900     136.92  142.56  139.54  142.41  140.64
2013-01-02  453638500     140.72  146.61  144.73  146.37  145.11

I could do this by converting the daily data to a python list, process the data and return the data to a dataframe, but how do can this be done with pandas?

user2766344
  • 83
  • 1
  • 1
  • 5

3 Answers3

6

Instead of M you can pass MS as the resample rule:

df =pd.DataFrame( range(72), index = pd.date_range('1/1/2011', periods=72, freq='D'))

#df.resample('MS', how = 'mean')    # pandas <0.18
df.resample('MS').mean()  # pandas >= 0.18

Updated to use the first business day of the month respecting US Federal Holidays:

df =pd.DataFrame( range(200), index = pd.date_range('12/1/2012', periods=200, freq='D'))

from pandas.tseries.offsets import CustomBusinessMonthBegin
from pandas.tseries.holiday import USFederalHolidayCalendar
bmth_us = CustomBusinessMonthBegin(calendar=USFederalHolidayCalendar())

df.resample(bmth_us).mean()

if you want custom starts of the month using the min month found in the data try this. (It isn't pretty, but it should work).

month_index =df.index.to_period('M')

min_day_in_month_index = pd.to_datetime(df.set_index(new_index, append=True).reset_index(level=0).groupby(level=0)['level_0'].min())

custom_month_starts =CustomBusinessMonthBegin(calendar = min_day_in_month_index)

Pass custom_start_months to the fist parameter of resample

FLab
  • 7,136
  • 5
  • 36
  • 69
JAB
  • 12,401
  • 6
  • 45
  • 50
  • Thank you, I wasn't aware of this option. However, this results in the month start date, not the oldest date of the month in the data. See here: >>> data_dy.resample('MS', how=ohlc_dict) Volume Adj Close High Low Close Open Date 2012-12-01 392741900 136.92 142.56 139.54 142.41 140.64 2013-01-01 453638500 140.72 146.61 144.73 146.37 145.11 [2 rows x 6 columns] – user2766344 Jan 24 '15 at 17:27
  • I also found this list of example resample options here: https://stackoverflow.com/questions/17001389/pandas-resample-documentation. – user2766344 Jan 24 '15 at 17:33
  • Trying the 'BMS' option yields: >>> data_dy.resample('BMS', how=ohlc_dict) Volume Adj Close High Low Close Open Date 2012-12-03 392741900 136.92 142.56 139.54 142.41 140.64 2013-01-01 453638500 140.72 146.61 144.73 146.37 145.11 Still not quite what I was looking for. – user2766344 Jan 24 '15 at 17:34
  • I am not clear on what `2012-12-28` represents? The first day of which period? – JAB Jan 24 '15 at 17:38
  • The original question seems to ask for the first trading day of the month. Is this correct? Your comment makes me think otherwise. – JAB Jan 24 '15 at 17:45
  • I'm sorry if that confused things. In my example I only listed truncated data, but in the truncated data set it means the first trading day on December 2012 (obviously not a real situation). I have calculated data for non-us markets, so the first trading day of the month varies between markets, depending on local holidays. It would be most flexible if the date returned is the date from the first data row in each month and use the dictionary rules for the other columns. – user2766344 Jan 24 '15 at 17:48
  • If you create a timeseries index of the first dates in the data by month and pass them to the`calendar` parameter in `CustomBusinessMonthBegin` this should work. – JAB Jan 24 '15 at 17:53
  • @J_Bradley Thank you for your patience and expertise. I will try this solution. – user2766344 Jan 24 '15 at 18:00
  • @user2766344 you are welcome. Update to create an timeseries index using the minimum date in the index. There is likely a more elegant way to do this, but I am not aware of it. – JAB Jan 24 '15 at 18:16
4

Thank you J Bradley, your solution worked perfectly. I did have to upgrade my version of pandas from their official website though as the version installed via pip did not have CustomBusinessMonthBegin in pandas.tseries.offsets. My final code was:

#----- imports -----
import pandas as pd
from pandas.tseries.offsets import CustomBusinessMonthBegin
import pandas.io.data as web
#----- get sample data -----
df = web.get_data_yahoo('SPY', '2012-12-01', '2013-12-31')
#----- build custom calendar -----
month_index =df.index.to_period('M')
min_day_in_month_index = pd.to_datetime(df.set_index(month_index, append=True).reset_index(level=0).groupby(level=0)['Open'].min())
custom_month_starts = CustomBusinessMonthBegin(calendar = min_day_in_month_index)
#----- convert daily data to monthly data -----
ohlc_dict = {'Open':'first','High':'max','Low':'min','Close': 'last','Volume': 'sum','Adj Close': 'last'}
mthly_ohlcva = df.resample(custom_month_starts, how=ohlc_dict)

This yielded the following:

>>> mthly_ohlcva
                Volume  Adj Close    High     Low   Close    Open
Date                                                             
2012-12-03  2889875900     136.92  145.58  139.54  142.41  142.80
2013-01-01  2587140200     143.92  150.94  144.73  149.70  145.11
2013-02-01  2581459300     145.76  153.28  148.73  151.61  150.65
2013-03-01  2330972300     151.30  156.85  150.41  156.67  151.09
2013-04-01  2907035000     154.20  159.72  153.55  159.68  156.59
2013-05-01  2781596000     157.84  169.07  158.10  163.45  159.33
2013-06-03  3533321800     155.74  165.99  155.73  160.42  163.83
2013-07-01  2330904500     163.78  169.86  160.22  168.71  161.26
2013-08-01  2283131700     158.87  170.97  163.05  163.65  169.99
2013-09-02  2226749600     163.90  173.60  163.70  168.01  165.23
2013-10-01  2901739000     171.49  177.51  164.53  175.79  168.14
2013-11-01  1930952900     176.57  181.75  174.76  181.00  176.02
2013-12-02  2232775900     181.15  184.69  177.32  184.69  181.09
user2766344
  • 83
  • 1
  • 1
  • 5
0

I've seen in the last version of pandas you can use time offset alias 'BMS', which stands for "business month start frequency" or 'BM', which stands for "business month end frequency".

The code in the first case would look like

data_dy.resample('BMS', closed='right', label='right').apply(ohlc_dict)

or, in the second case,

data_dy.resample('BM', closed='right', label='right').apply(ohlc_dict)
Xoel
  • 318
  • 4
  • 15
  • Sounded great but even with `df.index = pd.to_datetime(df.index, format='%Y-%m-%d', errors='ignore')` there's an error: `TypeError: Only valid with DatetimeIndex, TimedeltaIndex or PeriodIndex, but got an instance of 'Index'`. Maybe someone has a solution? – TravelTrader Oct 10 '19 at 10:37