59

I have a dataframe that looks like:

data = {'index': ['2014-06-22 10:46:00', '2014-06-24 19:52:00', '2014-06-25 17:02:00', '2014-06-25 17:55:00', '2014-07-02 11:36:00', '2014-07-06 12:40:00', '2014-07-05 12:46:00', '2014-07-27 15:12:00'],
    'type': ['A', 'B', 'C', 'A', 'B', 'C', 'A', 'C'],
    'sum_col': [1, 2, 3, 1, 1, 3, 2, 1]}
df = pd.DataFrame(data, columns=['index', 'type', 'sum_col'])
df['index'] = pd.to_datetime(df['index'])
df = df.set_index('index')
df['weekofyear'] = df.index.weekofyear
df['date'] = df.index.date
df['date'] = pd.to_datetime(df['date'])



                     type sum_col weekofyear   date
index               
2014-06-22 10:46:00    A    1       25      2014-06-22
2014-06-24 19:52:00    B    2       26      2014-06-24
2014-06-25 17:02:00    C    3       26      2014-06-25
2014-06-25 17:55:00    A    1       26      2014-06-25
2014-07-02 11:36:00    B    1       27      2014-07-02
2014-07-06 12:40:00    C    3       27      2014-07-06
2014-07-05 12:46:00    A    2       27      2014-07-05
2014-07-27 15:12:00    C    1       30      2014-07-27

I'm looking to groupby the weekofyear, then sum up the sum_col. In addition, I need to find the earliest, and the latest date for the week. The first part is pretty easy:

gb = df.groupby(['type', 'weekofyear'])
gb['sum_col'].agg({'sum_col' : np.sum})

I've tried to find the min/max date with this, but haven't been successful:

gb = df.groupby(['type', 'weekofyear'])
gb.agg({'sum_col' : np.sum,
        'date' : np.min,
        'date' : np.max})

How would one find the earliest/latest date that appears?

DataSwede
  • 5,251
  • 10
  • 40
  • 66

3 Answers3

90

You need to combine the functions that apply to the same column, like this:

In [116]: gb.agg({'sum_col' : np.sum,
     ...:         'date' : [np.min, np.max]})
Out[116]: 
                      date             sum_col
                      amin       amax      sum
type weekofyear                               
A    25         2014-06-22 2014-06-22        1
     26         2014-06-25 2014-06-25        1
     27         2014-07-05 2014-07-05        2
B    26         2014-06-24 2014-06-24        2
     27         2014-07-02 2014-07-02        1
C    26         2014-06-25 2014-06-25        3
     27         2014-07-06 2014-07-06        3
     30         2014-07-27 2014-07-27        1
chrisb
  • 49,833
  • 8
  • 70
  • 70
  • 20
    To name the columns, supply a dictionary instead, like `gb.agg({'date': {'mindate': np.min, 'maxdate': np.max}})` – connorbode Feb 23 '17 at 22:08
  • 1
    Is above method support in pandas ? – CYC May 09 '20 at 02:10
  • 4
    @connorbode: it seems nested column naming is not supported in pandas v1.0.5 `pandas.core.base.SpecificationError: nested renamer is not supported` – tfad334 Sep 07 '20 at 03:46
  • 8
    `nested renamer is not supported` in newer version of pandas. Use `gb.date.agg(mindate=np.min, maxdate=np.max)` instead. – Ehsan Jan 19 '21 at 16:10
  • Here's a more flexible way to rename columns within the aggregate function: `agg(date_min=('date', 'min'), date_max=('date', 'max'))` – butterflyeffect Mar 07 '23 at 22:57
28

Simple code can be

df.groupby([key_field]).agg({'time_field': [np.min,np.max]})

where key_field here can be event_id and time_field can be timestamp field.

4

Another possible solution where you can have more control over the resulting column names:

gb = df.groupby(['type', 'weekofyear'])
gb.agg(
    sum_col=('sum_col', np.sum),
    first_date=('date', np.min),
    last_date=('date', np.max)
).reset_index()
Unai Sanchez
  • 496
  • 1
  • 6
  • 14
  • how would you able to keep rest of the column if dataframe had more column names? This method removes all columns right? – Sam Jul 20 '22 at 14:47