4

I have 5 years of S & P 500 data that I’m trying to group into specific time chunks to run some analysis on. My data is in 5 minute increments. After reading it into a DataFrame called dated, I thought I could groupby it into chunks, encompassing consecutive rows, by the time increments between the Globex open and close for each trading day. The Globex open is at 3:00PM (15:00) PST and the close is 1:15PM PST (13:15), 22.25 hours later. So I would like to chunk the data from the 3:00PM open THROUGH to the 1:15PM close--that's approximately 275 rows per chunk.

The difficulty is that a ‘trading day’ spans 2 different dates, for eaxmple, a session opens on Sunday 3-6-2016 at 15:00 and closes on Monday, 3-7-2016 at 13:15. I’m not able to groupby the ‘Date’ column because of course all of 3-6 would be in one chunk and 3-7 in another, when I need the data chunk to span both dates, in order to get the whole Globex day in one chunk.

Being relatively new to both Pandas and Python I don’t know what method to use to group the data into the chunks that I want. Once the data is segregated, I’d like to extract the high and low of each session/chunk into separate columns and also a column for just the 15:05 open price.

Here is a sample of what my data looks like:

    Date    Time    Open    High    Low Close   Up  Down
0   2015-08-03  15:05   2073.50 2074.00 2073.25 2073.25 210 563
1   2015-08-03  15:10   2073.25 2073.25 2072.25 2072.75 118 632
2   2015-08-03  15:15   2072.75 2072.75 2072.25 2072.50 132 85
3   2015-08-03  15:20   2072.50 2072.75 2072.25 2072.50 95  312
4   2015-08-03  15:25   2072.50 2074.00 2072.50 2073.50 372 264

Originally, the ‘Date’ column values looked like this: 8/3/2015. I thought that it may not be read as an actual date object, so I changed the values using to_datetime() to get the ‘Date’ column values to read like an actual date object as seen in my sample DataFrame above.

dated['Date'] =pd.to_datetime(dated['Date'])

When I tried to change the values in the ‘Time’ column using to_datetime(), it successfully changes the time from 15:05 to 15:05:00, but it also added the day, so it looked like this: ‘2016-03-05 15:05:00’, the problem being that it used today’s date for the date. This obviously won’t work for me as my data is historical and the dates and times are references to historical prices.
The reason I was trying to change the ‘Time’ column to a datetime object is I thought I would be able to slice it into the chunks that I need during a groupby operation:

dated = dated['Date'].groupby(dated['15:05' : '13:20'])  

This produced and error:

IndexError: invalid slice

So I’d appreciate any help solving this problem--pointing me in the correct areas of research. I've been essentially reading the pandas documentation piece by piece, trying out different methods, but as I'm not sure what steps to start with, I've been picking topics at random to read and not finding the answer.

Thanks, Ana

AnaB29
  • 105
  • 1
  • 3
  • 8

2 Answers2

2

This is actually very complicated.

First, you can convert your time as follows:

df['Datetime'] = pd.to_datetime(df.Date + ' ' + df.Time)

Here, I will create a larger sample datframe:

np.random.seed(0)
idx = pd.date_range('2015-1-1', '2016-1-1', freq='5min')
df = pd.DataFrame(np.random.randn(len(idx), 6),    
                  columns=['Open', 'High', 'Low', 'Close', 'Up', 'Down'])
df['Datetime'] = idx

Let's add a boolean flag to indicate when the market is open.

# Create a market open flag.
df['market_open'] = False
mask = (df.Datetime.dt.time > dt.time(15)) | (df.Datetime.dt.time < dt.time(13, 15))
df.loc[mask, 'market_open'] = True

Here we create a function than returns the open, high, low, close, etc. on the grouped bars:

def ohlc(df):
    return (
        df.Datetime.iat[-1], # last timestamp in group.
        df.Open.iat[0], # First Open.
        df.High.max(), 
        df.Low.min(), 
        df.Close.iat[-1], # Last Close.
        df.Up.sum(), 
        df.Down.sum(),
        df.Close.count(), # Count number of closing bars.
        df.market_open.iat[0])   # Take first True/False indicator.

Now we do our groupby based on the change in market_open (i.e. on changes in the True/False flag) and then we apply our function to these grouped results.

bars = pd.DataFrame(
    zip(*df.groupby(
            (df.market_open != df.market_open.shift())
            .cumsum()
             ).apply(ohlc))).T

bars.columns = ['bar_close_time', 'Open', 'High', 'Low', 'Close', 'Up', 'Down', 'bar_count', 'market_open']

We have bars for both open and closed sessions. We can remove the ones for when the market is closed.

# Remove bars when market is closed
bars = bars[bars.market_open].iloc[:, :-1]

>>> bars.tail()
          bar_close_time      Open     High      Low      Close        Up     Down bar_count
722  2015-12-28 13:10:00   1.23175  2.88569  -2.7143  -0.785648  -13.3166  14.6094       266
724  2015-12-29 13:10:00 -0.900675   2.6483 -2.61698    -0.8265  0.825872  4.98565       266
726  2015-12-30 13:10:00   1.65299  2.57881 -2.85199  -0.376141  -4.32867  3.62123       266
728  2015-12-31 13:10:00  0.435619  2.93638 -2.74758  -0.461525  -20.0928 -15.8205       266
730  2016-01-01 00:00:00  0.293165  2.39097  -2.1234  0.0684124  -7.83721  1.69182       108
Alexander
  • 105,104
  • 32
  • 201
  • 196
  • Thank you Alexander, When I got this part of the code: `dated['markt_open'] =False mask = (dated.Datetime.dt.time > dt.time(15)) | (dated.Datetime.dt.time < dt.time(13, 15)) dated.loc[mask, 'markt_open'] = True` I receive this error: `AttributeError: 'DataFrame' object has no attribute 'datetime'` BTW I’m using a Jupyter notebook with Python 2.7 import pandas as pd import numpy as np from datetime import datetime Googled for a solution, but I’m led to time series stuff in relation to the Series constructor as opposed to the DataFrame constructor…help? – AnaB29 Mar 06 '16 at 04:53
  • Import datetime as dt – Alexander Mar 06 '16 at 05:02
  • Hi Alexander and Thanks so much for helping, Still the same error: `AttributeError: 'DataFrame' object has no attribute 'datetime'` Imports adjusted: `import pandas as pd--import numpy as np--import datetime as dt`. I also fixed my typos. I looked for Datetime attribute here: [link](http://pandas.pydata.org/pandas-docs/version/0.17.0/api.html#dataframe) but not found Doc. I’m not sure what I’m missing. – AnaB29 Mar 06 '16 at 19:12
  • You notice that I am using an uppercase `D` in Datetime above, right? `df['Datetime'] = pd.to_datetime(df.Date + ' ' + df.Time)` – Alexander Mar 06 '16 at 19:21
  • Found the error. Instead of creating a large samp/DF as you did, I used my ES data file and started with `dated[‘Date Time’]` instead of `dated['Datetime']` and didn’t realize it further in the code. It works perfectly--now I just need to figure out _how_ it works. I understand all up to zip(). So I’ll pick that apart. As you see I’m learning as I go. THANKS again for your help!! Can you recommend books other than "Python for Data Analysis" by McKinney (have this one) that teaches how to create programs similar to the one I'm working on? – AnaB29 Mar 06 '16 at 21:58
  • For zip, `[(a, b) for a, b in zip([1,2,3], [4,5,6])]` yields `[(1, 4), (2, 5), 3, 6)]`. Regarding `*` inside, refer to this: http://stackoverflow.com/questions/2511300/why-does-x-y-zipzipa-b-work-in-python – Alexander Mar 06 '16 at 22:04
0

You could consider an auxiliary datetime column that equals the original one minus 14 hours. Then each trading session would be in just one day.

This is the same as considering another time zone for datetime, in which the trading session does not span two days.

toliveira
  • 1,533
  • 16
  • 27