11

How do I get just the 5 minute data using Python/pandas out of this csv? For every 5 minute interval I'm trying to get the DATE, TIME,OPEN, HIGH, LOW, CLOSE, VOLUME for that 5 minute interval.

DATE       TIME     OPEN    HIGH    LOW     CLOSE   VOLUME
02/03/1997 09:04:00 3046.00 3048.50 3046.00 3047.50 505          
02/03/1997 09:05:00 3047.00 3048.00 3046.00 3047.00 162          
02/03/1997 09:06:00 3047.50 3048.00 3047.00 3047.50 98           
02/03/1997 09:07:00 3047.50 3047.50 3047.00 3047.50 228          
02/03/1997 09:08:00 3048.00 3048.00 3047.50 3048.00 136          
02/03/1997 09:09:00 3048.00 3048.00 3046.50 3046.50 174          
02/03/1997 09:10:00 3046.50 3046.50 3045.00 3045.00 134          
02/03/1997 09:11:00 3045.50 3046.00 3044.00 3045.00 43           
02/03/1997 09:12:00 3045.00 3045.50 3045.00 3045.00 214          
02/03/1997 09:13:00 3045.50 3045.50 3045.50 3045.50 8            
02/03/1997 09:14:00 3045.50 3046.00 3044.50 3044.50 152
ayhan
  • 70,170
  • 20
  • 182
  • 203
Sam c21
  • 253
  • 1
  • 2
  • 9
  • Questions must show minimal research effort. What have you tried? – Luke Taylor Apr 17 '16 at 21:17
  • 1
    Pasting a link doesn't tell me you've tried. That's not a demonstration that you put effort into solving this before you asked here. Can you post some code that you tried? If you don't have any code yet, you may not want to ask here until you do. – Luke Taylor Apr 18 '16 at 00:45
  • 1
    I went threw this and tried a lot of different things. [link(http://pandas.pydata.org/pandas-docs/version/0.17.1/generated/pandas.DataFrame.html) I tried asfreq and clip and couple of other things. I thought about .at_time every interval and somehow adding them together. But I think there is a better way. But I'll keep trying. – Sam c21 Apr 18 '16 at 00:52
  • That sounds like you've made a pretty decent effort. If you edit your question to include the code you used to try these things, it would make your question much more helpful. Describe each code snippet and how it didn't work as you expected. It's good just to show that you tried some things. – Luke Taylor Apr 18 '16 at 00:55
  • I accidentally just posted the link. I've tried a few different things but i usually delete them after I've tried them. – Sam c21 Apr 18 '16 at 00:56
  • You may also want to only include a snippet of your CSV file rather than posting the whole thing. – Luke Taylor Apr 18 '16 at 00:56
  • Ok thanks man, appreciate it – Sam c21 Apr 18 '16 at 00:57
  • I proposed a further edit to neaten up the data and cut off more of it. Only enough is needed so that potential answers can get the gist of it. You can accept the edit. – Luke Taylor Apr 18 '16 at 01:01

3 Answers3

19

You can use df.resample to do aggregation based on a date/time variable. You'll need a datetime index and you can specify that while reading the csv file:

df = pd.read_csv("filename.csv", parse_dates = [["DATE", "TIME"]], index_col=0)

This will result in a dataframe with an index where date and time are combined (source):

df.head()
Out[7]: 
                       OPEN    HIGH     LOW   CLOSE  VOLUME 
DATE_TIME                                                   
1997-02-03 09:04:00  3046.0  3048.5  3046.0  3047.5      505
1997-02-03 09:05:00  3047.0  3048.0  3046.0  3047.0      162
1997-02-03 09:06:00  3047.5  3048.0  3047.0  3047.5       98
1997-02-03 09:07:00  3047.5  3047.5  3047.0  3047.5      228
1997-02-03 09:08:00  3048.0  3048.0  3047.5  3048.0      136

After that you can use resample to get the sum, mean, etc. of those five minute intervals.

df.resample("5T").mean()
Out[8]: 
                       OPEN    HIGH     LOW   CLOSE  VOLUME 
DATE_TIME                                                   
1997-02-03 09:00:00  3046.0  3048.5  3046.0  3047.5    505.0
1997-02-03 09:05:00  3047.6  3047.9  3046.8  3047.3    159.6
1997-02-03 09:10:00  3045.6  3045.9  3044.8  3045.0    110.2
1997-02-03 09:15:00  3043.6  3044.0  3042.8  3043.2     69.2
1997-02-03 09:20:00  3044.7  3045.2  3044.5  3045.0     65.8
1997-02-03 09:25:00  3043.8  3044.0  3043.5  3043.7     59.0
1997-02-03 09:30:00  3044.6  3045.0  3044.3  3044.6     56.0
1997-02-03 09:35:00  3044.5  3044.5  3043.5  3044.5     44.0

(T is used for minute frequency. Here is a list of other units.)

Community
  • 1
  • 1
ayhan
  • 70,170
  • 20
  • 182
  • 203
  • I can't get this to work when I tried ("5T") or even ("H") and i get this DatetimeIndexResampler [freq=<5 * Minutes>, axis=0, closed=left, label=left, convention=start, base=0]. But when I run just df.resample it works fine just without the interval adjustment. – Sam c21 Apr 18 '16 at 11:30
  • Can you explain what type of adjustment are you trying to do? – ayhan Apr 18 '16 at 11:41
  • `DatetimeIndexResampler [freq=<5 * Minutes>, axis=0, closed=left, label=left, convention=start, base=0]` is the object as a result of calling resample. You can apply methods on that object (like taking the mean, standard deviation etc.). If you don't call any method it will just return the grouping. Can you add the example output you expect to the question? – ayhan Apr 18 '16 at 11:53
  • I'm just trying to get just the 5 minute intervals without any kind of application methods like mean, standard deviation. Pretty much exactly what you did without mean just the 5 minute intervals – Sam c21 Apr 18 '16 at 21:02
  • probably to get the answer I'm looking for is (highest high for high) and (lowest low for low) and the (first interval open for open) and (last interval close for close) and (sum the volume) all that inside 5 minute interval. I'm looking it up now to see if that's possible. – Sam c21 Apr 18 '16 at 21:24
  • 1
    I think I got I looked up resample thanks for teaching me that. And I'm just just gonna do it by each column and then combine all my columns. Really appreciate it. The hardest thing is finding all these things and learning how to use them correctly. – Sam c21 Apr 18 '16 at 21:37
  • link to pandas, as few parameters changed: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.resample.html – Je Je Apr 21 '22 at 23:08
6

Another way using pandas is to use its TimeGrouper-function. Its purpose is just meant for use cases like yours.

import pandas as pd

df = pd.DataFrame("Your data provided above")
df["DATE"] = pd.to_datetime(df["DATE"])
df.set_index("DATE", inplace=True)

df = df.groupby(pd.TimeGrouper('5Min')).agg({
                                        "OPEN":  "first",
                                        "HIGH":  "max",
                                        "LOW":   "min",
                                        "CLOSE": "last",
                                        "VOLUME": "sum"
                                    })

The provided script uses an aggregation you might have in mind since you're dealing with stock-data. It aggregates in a way that you will end up with the 5-min candles resulting from your 1-min candles.

Markus
  • 2,265
  • 5
  • 28
  • 54
  • 1
    Good solution. But It groups 5 min data and add to the beginning of the time frame. Ex: it groups 9:00,9:01,9:022,9:03,9:04, and assign time value as 9:00 .But it is expected to group 9:01,9:022,9:03,9:04,9:05 and set it as 9:05 – defender Jul 01 '20 at 16:55
  • @defender: You are correct in describing its behaviour :) However, for financial-data this is an often used aggregation and in general very common (this is also why the pandas-devs implemented it like this), so I would not describe it as unexpected or undesired result. But of course it important is to point out the way it works so that anybody knows what he/she is getting out of it - Thx for that! :) – Markus Jul 01 '20 at 23:13
  • In recent versions of pandas, use `pd.Grouper` – Abhilash Awasthi Aug 13 '21 at 14:19
  • @AbhilashAwasthi: Not entirely true, see the discussion between defender and me; the result varies. – Markus Aug 13 '21 at 16:25
  • 2
    @Markus I was just referring the point that `pd.TimeGrouper` is deprecated in new versions of pandas and one should use `pd.Grouper`. See this - https://stackoverflow.com/questions/45156289/pandas-where-is-the-documentation-for-timegrouper . I was not commenting on result variation. – Abhilash Awasthi Aug 14 '21 at 04:17
6

slight modification to Markus answer. It groups and assign it to last index

df_close_left = data_set.groupby(pd.Grouper(freq='5Min',closed='right',label='right')).agg({
                                        "open":  "first",
                                        "high":  "max",
                                        "low":   "min",
                                        "close": "last",
                                        "volume": "sum"

                                    })
defender
  • 353
  • 2
  • 11