1

I have a DataFrame with a two-level MultiIndex. The first level date is a DatetimeIndex and the second level name is just some strings. The data has 10-minute intervals.

How can I group by date on the first level of this MultiIndex and count the number of rows I have per day?

I suspect that the DatetimeIndex coupled into a MultiIndex is giving me problems, since doing

data.groupby(pd.TimeGrouper(freq='D')).count()

gives me

TypeError: Only valid with DatetimeIndex, TimedeltaIndex or PeriodIndex, but got an instance of 'MultiIndex'

I've also tried writing

data.groupby(data.index.levels[0].date).count()

which leads to

ValueError: Grouper and axis must be same length

How could I, for example, make the grouper longer (i.e., include duplicate index values, omitting of which now make it shorter than the axis)?

Thanks!

basse
  • 1,088
  • 1
  • 19
  • 40

2 Answers2

2

You can use the level keyword in Grouper. (Also note that TimeGrouper is deprecated). This param is

the level for the target index.

Example DataFrame:

dates = pd.date_range('2017-01', freq='10MIN', periods=1000)
strs = ['aa'] * 1000
df = pd.DataFrame(np.random.rand(1000,2), index=pd.MultiIndex.from_arrays((dates, strs)))

Solution:

print(df.groupby(pd.Grouper(freq='D', level=0)).count())
              0    1
2017-01-01  144  144
2017-01-02  144  144
2017-01-03  144  144
2017-01-04  144  144
2017-01-05  144  144
2017-01-06  144  144
2017-01-07  136  136

Update: you noted in your comments that your resulting counts have zeros you'd like to drop. For instance, say your DataFrame is actually missing some days:

df = df.drop(df.index[140:400])
print(df.groupby(pd.Grouper(freq='D', level=0)).count())
              0    1
2017-01-01  140  140
2017-01-02    0    0
2017-01-03   32   32
2017-01-04  144  144
2017-01-05  144  144
2017-01-06  144  144
2017-01-07  136  136

To my knowledge there's no way to exclude zero counts within .count. Instead, you could use your result from above to drop zeros.

First solution (may be less preferable, because it converts and int result to float when np.nan is introduced, would be

res = df.groupby(pd.Grouper(freq='D', level=0)).count()
res = res.replace(0, np.nan).dropna()

Second and better solution, in my opinion, from here:

res = res[(res.T != 0).any()]
print(res) # notice - excludes 2017-01-02
              0    1
2017-01-01  140  140
2017-01-03   32   32
2017-01-04  144  144
2017-01-05  144  144
2017-01-06  144  144
2017-01-07  136  136

.any is from NumPy, ported over to pandas, and returns True when any element is True over requested axis.

Brad Solomon
  • 38,521
  • 31
  • 149
  • 235
  • Thanks, Brad, you answered my question perfectly. As a learning opportunity, I noticed that I get rows of zero counts and appending `.dropna()` to the `.groupby().count()` statement doesn't drop those. Any way to make the `Grouper` drop zero counts straight away in the same line? – basse Aug 04 '17 at 05:34
2

Assuming the Dataframe looks like this

d=pd.DataFrame([['Mon','foo',3],['Tue','bar',6],['Wed','qux',9]],
               columns=['date','name','amount'])\
              .set_index(['date','name'])

you can remove the name from the index only for this grouping operation

d.reset_index('name', drop=True)\
 .groupby('date')\
 ['amount'].count()
Marcel Flygare
  • 837
  • 10
  • 19