Why does groupby in Pandas place counts under existing column names?

Question

I'm coming from R and do not understand the default groupby behavior in pandas. I create a dataframe and groupby the column 'id' like so:

d = {'id': [1, 2, 3, 4, 2, 2, 4], 'color': ["r","r","b","b","g","g","r"], 'size': [1,2,1,2,1,3,4]}
df = DataFrame(data=d)
freq = df.groupby('id').count()

When I check the header of the resulting dataframe, all the original columns are there instead of just 'id' and 'freq' (or 'id' and 'count').

list(freq)
Out[117]: ['color', 'size']

When I display the resulting dataframe, the counts have replaced the values for the columns not employed in the count:

freq
Out[114]: 
    color  size
id             
1       1     1
2       3     3
3       1     1
4       2     2

I was planning to use groupby and then to filter by the frequency column. Do I need to delete the unused columns and add the frequency column manually? What is the usual approach?

`When I check the header of the resulting dataframe` you are not doing that because you are doing `list(df)` not `list(freq)`. `list(freq)` does give `['color', 'size']` — Bharath M Shetty, Nov 26 '17 at 11:00
To get the frequency of only one column use `value_counts()` rather than groupby. — Bharath M Shetty, Nov 26 '17 at 11:11

jezrael · Accepted Answer · 2017-11-26T11:12:12.383

count aggregate all columns of DataFrame with excluding NaNs values, if need id as column use as_index=False parameter or reset_index():

freq = df.groupby('id', as_index=False).count()
print (freq)
   id  color  size
0   1      1     1
1   2      3     3
2   3      1     1
3   4      2     2

So if add NaNs in each column should be differences:

d = {'id': [1, 2, 3, 4, 2, 2, 4], 
     'color': ["r","r","b","b","g","g","r"],
      'size': [np.nan,2,1,2,1,3,4]}
df = pd.DataFrame(data=d)

freq = df.groupby('id', as_index=False).count()
print (freq)
   id  color  size
0   1      1     0
1   2      3     3
2   3      1     1
3   4      2     2

You can specify columns for count:

freq = df.groupby('id', as_index=False)['color'].count()
print (freq)
   id  color
0   1      1
1   2      3
2   3      1
3   4      2

If need count with NaNs:

freq = df.groupby('id').size().reset_index(name='count')
print (freq)
   id  count
0   1      1
1   2      3
2   3      1
3   4      2

d = {'id': [1, 2, 3, 4, 2, 2, 4], 
     'color': ["r","r","b","b","g","g","r"],
      'size': [np.nan,2,1,2,1,3,4]}
df = pd.DataFrame(data=d)

freq = df.groupby('id').size().reset_index(name='count')
print (freq)
   id  count
0   1      1
1   2      3
2   3      1
3   4      2

Thanks Bharath for pointed for another solution with value_counts, differences are explained here:

freq = df['id'].value_counts().rename_axis('id').to_frame('freq').reset_index()
print (freq)
   id  freq
0   2     3
1   4     2
2   3     1
3   1     1

Maybe Op is looking for `df['id'].value_counts().to_frame('freq')` — Bharath M Shetty, Nov 26 '17 at 11:10
Thank you @jezrael. This produces the answer I was looking for `freq = df.groupby('id').size().reset_index(name='count')` — davideps, Nov 26 '17 at 12:33

Why does groupby in Pandas place counts under existing column names?

1 Answers1