37

I am trying to groupby a column and compute value counts on another column.

import pandas as pd
dftest = pd.DataFrame({'A':[1,1,1,1,1,1,1,1,1,2,2,2,2,2], 
               'Amt':[20,20,20,30,30,30,30,40, 40,10, 10, 40,40,40]})

dftest looks like

    A  Amt
0   1   20
1   1   20
2   1   20
3   1   30
4   1   30
5   1   30
6   1   30
7   1   40
8   1   40
9   2   10
10  2   10
11  2   40
12  2   40
13  2   40

perform grouping

grouper = dftest.groupby('A')
df_grouped = grouper['Amt'].value_counts()

which gives

   A  Amt
1  30     4
   20     3
   40     2
2  40     3
   10     2
Name: Amt, dtype: int64

what I want is to keep top two rows of each group.

Also, I was perplexed by an error when I tried to reset_index

df_grouped.reset_index()

it gives the following error

ValueError: cannot insert Amt, already exists
cottontail
  • 10,268
  • 18
  • 50
  • 51
muon
  • 12,821
  • 11
  • 69
  • 88

2 Answers2

66

You need parameter name in reset_index, because Series name is same as name of one of levels of MultiIndex:

df_grouped.reset_index(name='count')

Another solution is rename Series name:

print (df_grouped.rename('count').reset_index())

   A  Amt  count
0  1   30      4
1  1   20      3
2  1   40      2
3  2   40      3
4  2   10      2

More common solution instead value_counts is aggregate size:

df_grouped1 =  dftest.groupby(['A','Amt']).size().reset_index(name='count')

print (df_grouped1)
   A  Amt  count
0  1   20      3
1  1   30      4
2  1   40      2
3  2   10      2
4  2   40      3
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • perfect!! addresses the reset index issue... is there a better way to keep top n rows by group, count ... right now after trying a few things, only possible way that i can think of is first groupby.value_counts, then subset – muon Sep 29 '16 at 19:50
  • Maybe need [`nlargest`](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.nlargest.html) - `dftest.groupby(['A','Amt']).size().nlargest(3)` – jezrael Sep 29 '16 at 19:56
  • that does not do it by group, only gives overall nlargest – muon Sep 29 '16 at 20:02
  • 2
    you can apply nlargest to groupby, so a way could be to group again against your level 0: `df_grouped.groupby(level=0).nlargest(2)` – Zeugma Sep 29 '16 at 20:36
  • @Boud 's solution worked on the example mentioned. I am having trouble getting it working on multi-level-index, i might post as a separate question – muon Sep 30 '16 at 14:09
  • I think you can provide multiple levels on the level argument – Zeugma Sep 30 '16 at 14:47
  • 5
    "name" is depracted in newer version of pandas: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.reset_index.html – Mermoz May 29 '18 at 10:24
  • 1
    One can also use `dftest.groupby(['A','Amt']).size().reset_index(name='count')` – Sheldore May 14 '20 at 10:53
  • `name` is a parameter of [Series.reset_index](https://pandas.pydata.org/docs/reference/api/pandas.Series.reset_index.html), not `DataFrame.reset_index` (which has a parameter `names`). `groupby` produces a `Series` (not a `DataFrame`) when there's only a single non-index column – nirvana-msu Feb 02 '23 at 12:03
  • Thanks! This saved me from issues trying to send a groupby object to mysql using pd.to_sql. Kept getting errors until I reindexed. – TASC Solutions May 31 '23 at 14:10
0

To avoid reset_index altogether, groupby.size may be used with as_index=False parameter (groupby.size produces the same output as value_counts - both drop NaNs by default anyway).

dftest.groupby(['A','Amt'], as_index=False).size()

Since pandas 1.1., groupby.value_counts is a redundant operation because value_counts() can be directly called on the dataframe and produce the same output.

dftest.value_counts(['A', 'Amt']).reset_index(name='count')

Since pandas 1.5., reset_index() admits allow_duplicates= parameter, which may be flagged to allow duplicate column names (as in the OP):

grouper = dftest.groupby('A')
grouper['Amt'].value_counts().reset_index(allow_duplicates=True)
cottontail
  • 10,268
  • 18
  • 50
  • 51