1

I have this pandas dataframe with a datetime, an integer, and a string column

from io import StringIO
import pandas as pd
data1 =  """Year        N   X
            2008-01-01  2   A
            2008-01-01  3   B
            2008-01-01  6   C
            2008-01-01  2   D
            2010-01-01  7   A
            2010-01-01  1   B
            2010-01-01  8   C
            2012-01-01  9   A
            2012-01-01  4   B
            2016-01-01  1   A"""

df = pd.read_csv(StringIO(data1), delim_whitespace=True, parse_dates=["Year"])

I can aggregate column N for count, min, and max simply as:

df1 = df.groupby("X")["N"].agg(Count="count", Min="min", Max="max").reset_index()
print(df1)

   X  Count  Min  Max
0  A      4    1    9
1  B      3    1    4
2  C      2    6    8
3  D      1    2    2

Is there a way to achieve the same for column Year only displaying the year? I can do this in several steps:

g = df.groupby("X")["Year"]
df2 = g.agg(Count= "count").reset_index()
df2["Start_date"] = g.min().dt.year.values
df2["End_date"] = g.max().dt.year.values
print(df2)

   X  Count  Start_date  End_date
0  A      4        2008      2016
1  B      3        2008      2012
2  C      2        2008      2010
3  D      1        2008      2008

But a version similar to that above for Nsuch as

df2 = df.groupby("X")["Year"].agg(Count="count", Min="min().dt.year.values", Max="max().dt.year.values").reset_index()

obviously does not work. Is there a simpler way to aggregate the first and last year in pandas groupby (apart from the obvious approach to first extract min/max dates as above, then convert the datetime columns into year columns)?

Mr. T
  • 11,960
  • 10
  • 32
  • 54

2 Answers2

3
df = pd.read_csv(io.StringIO("""Year        N   X
            2008-01-01  2   A
            2008-01-01  3   B
            2008-01-01  6   C
            2008-01-01  2   D
            2010-01-01  7   A
            2010-01-01  1   B
            2010-01-01  8   C
            2012-01-01  9   A
            2012-01-01  4   B
            2016-01-01  1   A"""), sep="\s+")
df.Year = pd.to_datetime(df.Year)

df = df.groupby("X").agg(N=("N","count"),
                    Start_date=("Year",lambda f: min(f.dt.year)),
                    End_date=("Year",lambda f: max(f.dt.year)))

X N Start_date End_date
A 4 2008 2016
B 3 2008 2012
C 2 2008 2010
D 1 2008 2008
Rob Raymond
  • 29,118
  • 3
  • 14
  • 30
1

Have you tried using GroupBy.agg with named aggregation?

df.assign(Year=pd.to_datetime(df['Year']).dt.year).groupby('X').agg(
    N=('N', 'count'), Start_date=('Year', 'first'), End_date=('Year', 'last'),)

   N  Start_date  End_date
X                         
A  4        2008      2016
B  3        2008      2012
C  2        2008      2010
D  1        2008      2008

If the dates aren't ascending, use 'min' and 'max' instead of 'first' and 'last', respectively.

This way allows you to avoid using lambda expressions in the grouper (so this is quite performant). More on named aggregation can be found in my post here.

cs95
  • 379,657
  • 97
  • 704
  • 746
  • I think you have the power to single-handedly close my question as a duplicate of a "named aggregation" question. It seems this is what it boils down to. – Mr. T Feb 14 '21 at 19:17
  • 1
    I personally think its fine to have more than one way to ask the same thing, especially if you've made an honest effort and are reaching out for help with the last mile. So I have no strong preference either way. But since you seem not to mind, and since the linked dupe has my answer anyway I think I'll go ahead and close it for ya :p – cs95 Feb 14 '21 at 19:26