10

I need some directions in grouping a Pandas DateFrame object by year or month and get in return an new DateFrame object with a new index. Here is my code so far. groupby works as intended.

Load data from .csv file, parse 'Date' to date format (historical stock quotes from finance.yahoo.com)

In [23]: import pandas as pd
         file = pd.read_csv("sdf.de.csv", parse_dates=['Date'])
         file.head(2)

Out[23]:
    Date        Open    High    Low     Close   Volume  Adj Close
0   2016-02-16  18.650  18.70   17.940  18.16   1720800 17.0600
1   2016-02-15  18.295  18.64   18.065  18.50   1463500 17.3794

sort file for 'Date' ascending and set index to Date

In [24]: daily = file.sort_values(by='Date').set_index('Date')
         daily.head()

Out[24]:
            Open    High    Low     Close   Volume  Adj Close
Date                        
2000-01-03  14.20   14.50   14.15   14.40   277400  2.7916
2000-01-04  14.29   14.30   13.90   14.15   109200  2.7431

grouping for month

I would do an additional apply() to the groups, which would condense the data for the specific group, e.g. find the highest High value for the year/month or sum() the Volume values. This step is omitted for this example.

In [39]: monthly = daily.groupby(lambda x: (x.year, x.month))
         monthly.first()

Out[39]:
            Open    High    Low     Close   Volume  Adj Close
(2000, 1)   14.200  14.500  14.150  14.400  277400  2.7916
(2000, 2)   13.900  14.390  13.900  14.250  287200  2.7625
... ... ... ... ... ... ...
(2016, 1)   23.620  23.620  23.620  23.620  0       22.1893
(2016, 2)   19.575  19.630  19.140  19.450  1783000 18.2719

This works, but it gives me a DateFrame object with a tuple as index.

The desired result, in this case for grouping for month, would be a complete new DataFrame object, but the Date index should be a new DatetimeIndex in the form of %Y-%m or just %Y if grouped by year.

Out[39]:
        Open    High    Low     Close   Volume  Adj Close
Date
2000-01 14.200  14.500  14.150  14.400  277400  2.7916
2000-02 13.900  14.390  13.900  14.250  287200  2.7625
... ... ... ... ... ... ...
2016-01 23.620  23.620  23.620  23.620  0       22.1893
2016-02 19.575  19.630  19.140  19.450  1783000 18.2719

I'm thankful for any directions.

dirk
  • 494
  • 1
  • 3
  • 17

2 Answers2

11

You can use groupby with daily.index.year, daily.index.month or change index to_period and then groupby by index:

print daily
              Open   High    Low  Close   Volume  Adj Close
Date                                                       
2000-01-01  14.200  14.50  14.15  14.40   277400     2.7916
2000-02-01  13.900  14.39  13.90  14.25   287200     2.7625
2016-01-01  23.620  23.62  23.62  23.62        0    22.1893
2016-02-01  19.575  19.63  19.14  19.45  1783000    18.2719

print daily.groupby([daily.index.year, daily.index.month]).first()
          Open   High    Low  Close   Volume  Adj Close
2000 1  14.200  14.50  14.15  14.40   277400     2.7916
     2  13.900  14.39  13.90  14.25   287200     2.7625
2016 1  23.620  23.62  23.62  23.62        0    22.1893
     2  19.575  19.63  19.14  19.45  1783000    18.2719

daily.index = daily.index.to_period('M')
print daily.groupby(daily.index).first()
           Open   High    Low  Close   Volume  Adj Close
Date                                                    
2000-01  14.200  14.50  14.15  14.40   277400     2.7916
2000-02  13.900  14.39  13.90  14.25   287200     2.7625
2016-01  23.620  23.62  23.62  23.62        0    22.1893
2016-02  19.575  19.63  19.14  19.45  1783000    18.2719
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • thanks, the last example works perfectly. but strangely, if i rerun this cell, I get an Attribute Error`----> 4 daily.index = daily.index.to_period('M') 5 monthly = daily.groupby(daily.index).first() 6 print (monthly) AttributeError: 'PeriodIndex' object has no attribute 'to_period'` I have to rerun **all** cells to make this work – dirk Feb 19 '16 at 08:22
  • IIUC you need set to `PeriodIndex` only once `daily.index = daily.index.to_period('M')` and then use other code. Try testing: `print daily`, `daily.index = daily.index.to_period('M')` and `print daily` – jezrael Feb 19 '16 at 08:34
  • but it changes the `PeriodIndex` of `daily` permanently. If I want to change it back to `daily.index = daily.index.to_period('D')`for days or work further with `daily.index = daily.index.to_period('A')`for year, I get the AttributeError – dirk Feb 19 '16 at 08:57
  • So then maybe is better use columns instead of index like: `daily['m'] = daily.index.to_period('M')` and `monthly = daily.groupby('m').first()`, then `daily['d'] = daily.index.to_period('D')` and `daily1 = daily.groupby('d').first()` – jezrael Feb 19 '16 at 09:02
  • Or you van remember `index` to column like `df['i'] = df.index` and then use `groupby` – jezrael Feb 19 '16 at 09:06
  • Thanks for your help. I think I try leaving the index an integer, then setting `Month`and `Year`columns. Than group with `monthly = daily.groupby(['Month'])`. – dirk Feb 19 '16 at 09:44
2

You can use a list comprehension to access the year and month accessor variable from your timestamps and then group on those.

>>> df.groupby([[d.year for d in df.Date], [d.month for d in df.Date]]).first()
             Date    Open   High    Low  Close   Volume  Adj_Close
2000 1 2000-01-01  14.200  14.50  14.15  14.40   277400     2.7916
     2 2000-02-01  13.900  14.39  13.90  14.25   287200     2.7625
2016 1 2016-01-01  23.620  23.62  23.62  23.62        0    22.1893
     2 2016-02-01  19.575  19.63  19.14  19.45  1783000    18.2719
Alexander
  • 105,104
  • 32
  • 201
  • 196