Consider the following synthetic example:
import pandas as pd
import numpy as np
np.random.seed(42)
ix = pd.date_range('2017-01-01', '2017-01-15', freq='1H')
df = pd.DataFrame(
{
'val': np.random.random(size=ix.shape[0]),
'cat': np.random.choice(['foo', 'bar'], size=ix.shape[0])
},
index=ix
)
Which yields a table of the following form:
cat val
2017-01-01 00:00:00 bar 0.374540
2017-01-01 01:00:00 foo 0.950714
2017-01-01 02:00:00 bar 0.731994
2017-01-01 03:00:00 bar 0.598658
2017-01-01 04:00:00 bar 0.156019
Now, I want to count the number and the average value of instances per each category and date.
The following groupby
, is almost perfect:
df.groupby(['cat',df.index.date]).agg({'val': ['count', 'mean']})
returning:
val
count mean
cat
bar 2017-01-01 16 0.437941
2017-01-02 16 0.456361
2017-01-03 9 0.514388...
The problem with this one, is that the second level of the index turned into strings and not date
. First question: Why is it happening? How can I avoid it?
Next, I tried a combination of groupby
and resample
:
df.groupby('cat').resample('1d').agg({'val': 'mean'})
Here, the index is correct, but I fail to run both mean
and count
aggregations. This is the second question: why does
df.groupby('cat').resample('1d').agg({'val': ['mean', 'count']})
Doesn't work?
Last question what is the clean way to get an aggregated (using both functions) view and with date
type for the index?