How to group pandas DataFrame entries by date in a non-unique column

Question

A Pandas DataFrame contains column named "date" that contains non-unique datetime values. I can group the lines in this frame using:

data.groupby(data['date'])

However, this splits the data by the datetime values. I would like to group these data by the year stored in the "date" column. This page shows how to group by year in cases where the time stamp is used as an index, which is not true in my case.

How do I achieve this grouping?

For those coming here in 2017+, there are several new ways to groupby a particular amount of time. [See this answer below](https://stackoverflow.com/a/47140458/3707607) — Ted Petrou, Nov 06 '17 at 15:35

score 132 · Accepted Answer · edited Dec 08 '16 at 19:58

132

I'm using pandas 0.16.2. This has better performance on my large dataset:

data.groupby(data.date.dt.year)

Using the dt option and playing around with weekofyear, dayofweek etc. becomes far easier.

edited Dec 08 '16 at 19:58

feetwet

3,248
7
46
84

answered Sep 25 '15 at 13:55

DACW

2,601
2
18
16

Concur, this seems to be the pandaic way of accessing date attributes for a series. – dancow May 24 '16 at 00:59

score 80 · Answer 2 · answered Jul 09 '12 at 14:25

80

ecatmur's solution will work fine. This will be better performance on large datasets, though:

data.groupby(data['date'].map(lambda x: x.year))

answered Jul 09 '12 at 14:25

Wes McKinney

101,437
32
142
108

9

Why map instead of apply? – Gus Sep 27 '15 at 01:01
1

Afaik, `map` usually has some good efficiency qualities when applying arbitrary functions as compared to just using `apply`. – Coolio2654 Apr 24 '18 at 15:35

score 43 · Answer 3 · answered Nov 06 '17 at 15:34

This might be easier to explain with a sample dataset.

Create Sample Data

Let's assume we have a single column of Timestamps, date and another column we would like to perform an aggregation on, a.

df = pd.DataFrame({'date':pd.DatetimeIndex(['2012-1-1', '2012-6-1', '2015-1-1', '2015-2-1', '2015-3-1']),
                   'a':[9,5,1,2,3]}, columns=['date', 'a'])

df

        date  a
0 2012-01-01  9
1 2012-06-01  5
2 2015-01-01  1
3 2015-02-01  2
4 2015-03-01  3

There are several ways to group by year

Use the dt accessor with year property
Put date in index and use anonymous function to access year
Use resample method
Convert to pandas Period

`.dt` accessor with `year` property

When you have a column (and not an index) of pandas Timestamps, you can access many more extra properties and methods with the dt accessor. For instance:

df['date'].dt.year

0    2012
1    2012
2    2015
3    2015
4    2015
Name: date, dtype: int64

We can use this to form our groups and calculate some aggregations on a particular column:

df.groupby(df['date'].dt.year)['a'].agg(['sum', 'mean', 'max'])

      sum  mean  max
date                
2012   14     7    9
2015    6     2    3

put date in index and use anonymous function to access year

If you set the date column as the index, it becomes a DateTimeIndex with the same properties and methods as the dt accessor gives normal columns

df1 = df.set_index('date')
df1.index.year

Int64Index([2012, 2012, 2015, 2015, 2015], dtype='int64', name='date')

Interestingly, when using the groupby method, you can pass it a function. This function will be implicitly passed the DataFrame's index. So, we can get the same result from above with the following:

df1.groupby(lambda x: x.year)['a'].agg(['sum', 'mean', 'max'])

      sum  mean  max
2012   14     7    9
2015    6     2    3

Use the `resample` method

If your date column is not in the index, you must specify the column with the on parameter. You also need to specify the offset alias as a string.

df.resample('AS', on='date')['a'].agg(['sum', 'mean', 'max'])

             sum  mean  max
date                       
2012-01-01  14.0   7.0  9.0
2013-01-01   NaN   NaN  NaN
2014-01-01   NaN   NaN  NaN
2015-01-01   6.0   2.0  3.0

Convert to pandas Period

You can also convert the date column to a pandas Period object. We must pass in the offset alias as a string to determine the length of the Period.

df['date'].dt.to_period('A')

0   2012
1   2012
2   2015
3   2015
4   2015
Name: date, dtype: object

We can then use this as a group

df.groupby(df['date'].dt.to_period('Y'))['a'].agg(['sum', 'mean', 'max'])


      sum  mean  max
2012   14     7    9
2015    6     2    3

In the last method where you use `to_period('A')`, what is that ('A') for? — shiv_90, Feb 06 '18 at 17:51
@Shiv_90 the `'A'` is a timeseries offset-alias: https://pandas.pydata.org/pandas-docs/stable/timeseries.html#offset-aliases — ptim, Feb 07 '18 at 04:33
Which method would you recommend if one needs a separate 'date' column to be saved as well? For example, if I run the simple `.dt.year` method and save it in a new data frame, the dates are saved as indices and that becomes problematic if say I need to plot the data because the 'dates' column is not really there but only the three provided in `.agg()` — shiv_90, Feb 07 '18 at 10:53

score 15 · Answer 4 · answered Jul 09 '12 at 12:39

15

This should work:

data.groupby(lambda x: data['date'][x].year)

answered Jul 09 '12 at 12:39

ecatmur

152,476
27
293
366

score 1 · Answer 5 · answered Oct 08 '17 at 20:39

1

this will also work

data.groupby(data['date'].dt.year)

answered Oct 08 '17 at 20:39

Chetan Kabra

353
5
10

Should work but when executed it outputs the location of the object in the memory but no real output. `` is what I get when executed. – shiv_90 Feb 06 '18 at 17:49

Juan Paco Hernandez · Answer 6 · 2021-10-10T17:10:15.287

0

Use:

data.groupby(['col1', data.date.dt.year]).agg({'col2': 'agg_func'}).reset_index()

In case you want to group by year from datetime column and also by another different type column (col1)

edited Oct 10 '21 at 17:10

answered Oct 05 '21 at 21:43

Juan Paco Hernandez

56
4

How to group pandas DataFrame entries by date in a non-unique column

6 Answers6

`.dt` accessor with `year` property

put date in index and use anonymous function to access year

Use the `resample` method

Convert to pandas Period

Linked

Related

How to group pandas DataFrame entries by date in a non-unique column

6 Answers6

.dt accessor with year property

put date in index and use anonymous function to access year

Use the resample method

Convert to pandas Period

Linked

Related

`.dt` accessor with `year` property

Use the `resample` method