162

Consider a CSV file:

string,date,number
a string,2/5/11 9:16am,1.0
a string,3/5/11 10:44pm,2.0
a string,4/22/11 12:07pm,3.0
a string,4/22/11 12:10pm,4.0
a string,4/29/11 11:59am,1.0
a string,5/2/11 1:41pm,2.0
a string,5/2/11 2:02pm,3.0
a string,5/2/11 2:56pm,4.0
a string,5/2/11 3:00pm,5.0
a string,5/2/14 3:02pm,6.0
a string,5/2/14 3:18pm,7.0

I can read this in, and reformat the date column into datetime format:

b = pd.read_csv('b.dat')
b['date'] = pd.to_datetime(b['date'],format='%m/%d/%y %I:%M%p')

I have been trying to group the data by month. It seems like there should be an obvious way of accessing the month and grouping by that. But I can't seem to do it. Does anyone know how?

What I am currently trying is re-indexing by the date:

b.index = b['date']

I can access the month like so:

b.index.month

However I can't seem to find a function to lump together by month.

Mykola Zotko
  • 15,583
  • 3
  • 71
  • 73
Lee
  • 29,398
  • 28
  • 117
  • 170
  • 5
    If you are struggling with applying any of the answer, please keep in mind that in this question (and therefore in the answers) the Datetime value is assigned to the index of the Dataframe. A quick tip/reminder could be the following: if you have a Datetime column, you can actually access the single Yeay/Month/Day/Hour/Minute value just by doing `my_df.my_column.dt.month` – Federico Dorato Nov 30 '20 at 13:03

5 Answers5

256

Managed to do it:

b = pd.read_csv('b.dat')
b.index = pd.to_datetime(b['date'],format='%m/%d/%y %I:%M%p')
b.groupby(by=[b.index.month, b.index.year])

Or

b.groupby(pd.Grouper(freq='M'))  # update for v0.21+
goodside
  • 4,429
  • 2
  • 22
  • 32
Lee
  • 29,398
  • 28
  • 117
  • 170
  • 63
    I think the more pandonic ways are to either use `resample` (when it provides the functionality you need) or use a `TimeGrouper`: `df.groupby(pd.TimeGrouper(freq='M'))` – Karl D. Jun 06 '14 at 16:57
  • 13
    to get the result DataFrame sum or average, `df.groupby(pd.TimeGrouper(freq='M')).sum()` or `df.groupby(pd.TimeGrouper(freq='M')).mean()` – Alexandre Jan 28 '16 at 21:58
  • 16
    `pd.TimeGrouper` has been deprecated in favor of `pd.Grouper`, which is a bit more flexible but still takes `freq` and `level` arguments. – BallpointBen Dec 18 '18 at 01:18
  • the first method doesn't not appear to work. It gives the error, 'Series object has no attribute 'month'' for a Series created via `to_datetime`. – ely Sep 09 '19 at 20:36
  • 1
    @ely The answer implicitly relies on the lines in the original question where `b` is given an index after being read from CSV. Add `b.index = pd.to_datetime(b['date'],format='%m/%d/%y %I:%M%p')` after the line `b = pd.read_csv('b.dat')`. [I've edited the answer just now too.] – goodside Mar 05 '20 at 18:09
  • 1
    To group the months in chronological order, you need to swap the month and year index. The resulting command for the grouping being `b.groupby(by=[b.index.year, b.index.month])`. – MangoNrFive Oct 24 '22 at 17:36
116

(update: 2018)

Note that pd.Timegrouper is depreciated and will be removed. Use instead:

 df.groupby(pd.Grouper(freq='M'))
PandasRocks
  • 1,609
  • 1
  • 10
  • 12
  • 3
    Find the Grouper docs [here](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Grouper.html?highlight=grouper) and the frequency specifications (`freq=...`) [here](https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#offset-aliases). Some examples are `freq=D` for **days**, `freq=B` for **business days**, `freq=W` for **weeks** or even `freq=Q` for **quarters**. – Kim Mar 25 '20 at 12:09
  • 16
    I found it useful to use 'key' to avoid having to reindex the df, as follows: df.groupby(pd.Grouper(key='your_date_column', freq='M')) – Edward Oct 15 '20 at 15:00
  • Does this work if you're grouping by two columns, only one of which is datetime value column? – exlo Mar 17 '21 at 03:23
  • 2
    speeding up the further research for those who want to group by specific column (or more): df.groupby(['col1', pd.Grouper(key='date_col', freq='1M')]).agg({ 'col2': 'sum', 'col3': 'max' }) – Manaslu Apr 27 '21 at 08:09
29

To groupby time-series data you can use the method resample. For example, to groupby by month:

df.resample(rule='M', on='date')['Values'].sum()

The list with offset aliases you can find here.

Mykola Zotko
  • 15,583
  • 3
  • 71
  • 73
18

One solution which avoids MultiIndex is to create a new datetime column setting day = 1. Then group by this column.

Normalise day of month

df = pd.DataFrame({'Date': pd.to_datetime(['2017-10-05', '2017-10-20', '2017-10-01', '2017-09-01']),
                   'Values': [5, 10, 15, 20]})

# normalize day to beginning of month, 4 alternative methods below
df['YearMonth'] = df['Date'] + pd.offsets.MonthEnd(-1) + pd.offsets.Day(1)
df['YearMonth'] = df['Date'] - pd.to_timedelta(df['Date'].dt.day-1, unit='D')
df['YearMonth'] = df['Date'].map(lambda dt: dt.replace(day=1))
df['YearMonth'] = df['Date'].dt.normalize().map(pd.tseries.offsets.MonthBegin().rollback)

Then use groupby as normal:

g = df.groupby('YearMonth')

res = g['Values'].sum()

# YearMonth
# 2017-09-01    20
# 2017-10-01    30
# Name: Values, dtype: int64

Comparison with pd.Grouper

The subtle benefit of this solution is, unlike pd.Grouper, the grouper index is normalized to the beginning of each month rather than the end, and therefore you can easily extract groups via get_group:

some_group = g.get_group('2017-10-01')

Calculating the last day of October is slightly more cumbersome. pd.Grouper, as of v0.23, does support a convention parameter, but this is only applicable for a PeriodIndex grouper.

Comparison with string conversion

An alternative to the above idea is to convert to a string, e.g. convert datetime 2017-10-XX to string '2017-10'. However, this is not recommended since you lose all the efficiency benefits of a datetime series (stored internally as numerical data in a contiguous memory block) versus an object series of strings (stored as an array of pointers).

jpp
  • 159,742
  • 34
  • 281
  • 339
  • See this answer for the proper way to utilize offsets when there are already day=1 values : https://stackoverflow.com/a/45831333/9987623. – AlexK Nov 22 '20 at 04:08
  • @AlexK, does `pd.tseries.offsets` have an advantage over `pd.tseries.MonthBegin` ? – jpp Nov 22 '20 at 12:07
  • sorry, I don't know enough to tell those apart. I just added the comment because your `df['YearMonth'] = df['Date'] - pd.offsets.MonthBegin(1)` code above changes any date that is already the first of the month to the first of the previous month. – AlexK Nov 23 '20 at 01:27
  • @AlexK, Good spot, have updated answer accordingly. – jpp Nov 25 '20 at 17:59
  • For beginning of month with `Grouper`, see also https://stackoverflow.com/a/56280791/10495893 – Ben Reiniger Aug 05 '22 at 15:04
11

Slightly alternative solution to @jpp's but outputting a YearMonth string:

df['YearMonth'] = pd.to_datetime(df['Date']).apply(lambda x: '{year}-{month}'.format(year=x.year, month=x.month))

res = df.groupby('YearMonth')['Values'].sum()
tsando
  • 4,557
  • 2
  • 33
  • 35