Simplified pandas groupby aggregation for datetime column

Question

I have this pandas dataframe with a datetime, an integer, and a string column

from io import StringIO
import pandas as pd
data1 =  """Year        N   X
            2008-01-01  2   A
            2008-01-01  3   B
            2008-01-01  6   C
            2008-01-01  2   D
            2010-01-01  7   A
            2010-01-01  1   B
            2010-01-01  8   C
            2012-01-01  9   A
            2012-01-01  4   B
            2016-01-01  1   A"""

df = pd.read_csv(StringIO(data1), delim_whitespace=True, parse_dates=["Year"])

I can aggregate column N for count, min, and max simply as:

df1 = df.groupby("X")["N"].agg(Count="count", Min="min", Max="max").reset_index()
print(df1)

   X  Count  Min  Max
0  A      4    1    9
1  B      3    1    4
2  C      2    6    8
3  D      1    2    2

Is there a way to achieve the same for column Year only displaying the year? I can do this in several steps:

g = df.groupby("X")["Year"]
df2 = g.agg(Count= "count").reset_index()
df2["Start_date"] = g.min().dt.year.values
df2["End_date"] = g.max().dt.year.values
print(df2)

   X  Count  Start_date  End_date
0  A      4        2008      2016
1  B      3        2008      2012
2  C      2        2008      2010
3  D      1        2008      2008

But a version similar to that above for Nsuch as

df2 = df.groupby("X")["Year"].agg(Count="count", Min="min().dt.year.values", Max="max().dt.year.values").reset_index()

obviously does not work. Is there a simpler way to aggregate the first and last year in pandas groupby (apart from the obvious approach to first extract min/max dates as above, then convert the datetime columns into year columns)?

score 3 · Answer 1 · answered Feb 14 '21 at 17:08

named aggregations
lambda function on date series

df = pd.read_csv(io.StringIO("""Year        N   X
            2008-01-01  2   A
            2008-01-01  3   B
            2008-01-01  6   C
            2008-01-01  2   D
            2010-01-01  7   A
            2010-01-01  1   B
            2010-01-01  8   C
            2012-01-01  9   A
            2012-01-01  4   B
            2016-01-01  1   A"""), sep="\s+")
df.Year = pd.to_datetime(df.Year)

df = df.groupby("X").agg(N=("N","count"),
                    Start_date=("Year",lambda f: min(f.dt.year)),
                    End_date=("Year",lambda f: max(f.dt.year)))

X	N	Start_date	End_date
A	4	2008	2016
B	3	2008	2012
C	2	2008	2010
D	1	2008	2008

Nice. Never encountered named aggregation before. – Mr. T Feb 14 '21 at 17:12 — Mr. T, Feb 14 '21 at 17:12

score 1 · Accepted Answer · answered Feb 14 '21 at 19:00

1

Have you tried using GroupBy.agg with named aggregation?

df.assign(Year=pd.to_datetime(df['Year']).dt.year).groupby('X').agg(
    N=('N', 'count'), Start_date=('Year', 'first'), End_date=('Year', 'last'),)

   N  Start_date  End_date
X                         
A  4        2008      2016
B  3        2008      2012
C  2        2008      2010
D  1        2008      2008

If the dates aren't ascending, use 'min' and 'max' instead of 'first' and 'last', respectively.

This way allows you to avoid using lambda expressions in the grouper (so this is quite performant). More on named aggregation can be found in my post here.

answered Feb 14 '21 at 19:00

cs95

379,657
97
704
746

I think you have the power to single-handedly close my question as a duplicate of a "named aggregation" question. It seems this is what it boils down to. – Mr. T Feb 14 '21 at 19:17
1

I personally think its fine to have more than one way to ask the same thing, especially if you've made an honest effort and are reaching out for help with the last mile. So I have no strong preference either way. But since you seem not to mind, and since the linked dupe has my answer anyway I think I'll go ahead and close it for ya :p – cs95 Feb 14 '21 at 19:26

Simplified pandas groupby aggregation for datetime column

2 Answers2

Linked