Group by year/month/day in pandas

Question

Assume having the following DataFrame:

rng = pd.date_range('1/1/2011', periods=72, freq='H')
np.random.seed(10)
n = 10
df = pd.DataFrame(
    {
        "datetime": np.random.choice(rng,n),
        "cat": np.random.choice(['a','b','b'], n),
        "val": np.random.randint(0,5, size=n)
        }
    )

If I now groupby:

gb = df.groupby(['cat','datetime']).sum()

I get the totals for each cat for each hour:

cat datetime            val
a   2011-01-01 00:00:00 1
    2011-01-01 09:00:00 3
    2011-01-02 16:00:00 1
    2011-01-03 16:00:00 1
b   2011-01-01 08:00:00 4
    2011-01-01 15:00:00 3
    2011-01-01 16:00:00 3
    2011-01-02 04:00:00 4
    2011-01-02 05:00:00 1
    2011-01-02 12:00:00 4

However, I would like to have something like:

cat datetime   val
a   2011-01-01 4
    2011-01-02 1
    2011-01-03 1
b   2011-01-01 10
    2011-01-02 9

I could get the desired result by adding another column called date:

df['date'] = df.datetime.apply(pd.datetime.date)

and then do a similar groupby: df.groupby(['cat','date']).sum(). But I am interested whether there's more pythonic way to do it? In addition, I might want to have a look on the month or year level. So, what would be the right way?

Are you going to just filter or do you want to sum/resample? it may be better to split your date into year month day components and set this to the index so you can call `sum(level=[1,2])` for instance. Or to set the index to the date column, `resample` and then groupby on 'cat' and perform the aggregations — EdChum, Mar 09 '16 at 15:43
It seems to me that the solution I suggested is the starting point of what you have in mind, but I don't understand how to put it together. — Dror, Mar 09 '16 at 15:49

score 1 · Answer 1 · edited May 23 '17 at 12:08

From your intermediate structure, you can use .unstack to separate the categories, do a .resample, and then .stack again to get back to the original form:

In [126]: gb = df.groupby(['cat', 'datetime']).sum()

In [127]: gb.unstack(0)
Out[127]:
                     val
cat                    a    b
datetime
2011-01-01 00:00:00  1.0  NaN
2011-01-01 08:00:00  NaN  4.0
2011-01-01 09:00:00  3.0  NaN
2011-01-01 15:00:00  NaN  3.0
2011-01-01 16:00:00  NaN  3.0
2011-01-02 04:00:00  NaN  4.0
2011-01-02 05:00:00  NaN  1.0
2011-01-02 12:00:00  NaN  4.0
2011-01-02 16:00:00  1.0  NaN
2011-01-03 16:00:00  1.0  NaN

In [128]: gb.unstack(0).resample("D").sum().stack()
Out[128]:
                 val
datetime   cat
2011-01-01 a     4.0
           b    10.0
2011-01-02 a     1.0
           b     9.0
2011-01-03 a     1.0

EDIT: For other resampling frequencies (month, year, etc.) there's a good list of the options at pandas resample documentation

jezrael · Answer 2 · 2016-03-09T16:53:01.140

You can try set_index and then groupby by cat and date:

import pandas as pd
import numpy as np

rng = pd.date_range('1/1/2011', periods=72, freq='H')
np.random.seed(10)
n = 10
df = pd.DataFrame(
    {
        "datetime": np.random.choice(rng,n),
        "cat": np.random.choice(['a','b','b'], n),
        "val": np.random.randint(0,5, size=n)
        }
    )
print df
  cat            datetime  val
0   a 2011-01-01 09:00:00    3
1   b 2011-01-01 15:00:00    3
2   a 2011-01-03 16:00:00    1
3   b 2011-01-02 04:00:00    4
4   b 2011-01-02 05:00:00    1
5   b 2011-01-01 08:00:00    4
6   a 2011-01-01 00:00:00    1
7   a 2011-01-02 16:00:00    1
8   b 2011-01-02 12:00:00    4
9   b 2011-01-01 16:00:00    3

df = df.set_index('datetime')
gb = df.groupby(['cat', lambda x: x.date]).sum()
print gb
                val
cat                
a   2011-01-01    4
    2011-01-02    1
    2011-01-03    1
b   2011-01-01   10
    2011-01-02    9

thank you for the `lambda x: x.date` trick - i've again learned something new from you. +1 — MaxU - stand with Ukraine, Mar 10 '16 at 12:03

Group by year/month/day in pandas

2 Answers2