1

I'm coming from R and do not understand the default groupby behavior in pandas. I create a dataframe and groupby the column 'id' like so:

d = {'id': [1, 2, 3, 4, 2, 2, 4], 'color': ["r","r","b","b","g","g","r"], 'size': [1,2,1,2,1,3,4]}
df = DataFrame(data=d)
freq = df.groupby('id').count()

When I check the header of the resulting dataframe, all the original columns are there instead of just 'id' and 'freq' (or 'id' and 'count').

list(freq)
Out[117]: ['color', 'size']

When I display the resulting dataframe, the counts have replaced the values for the columns not employed in the count:

freq
Out[114]: 
    color  size
id             
1       1     1
2       3     3
3       1     1
4       2     2

I was planning to use groupby and then to filter by the frequency column. Do I need to delete the unused columns and add the frequency column manually? What is the usual approach?

davideps
  • 541
  • 3
  • 13

1 Answers1

3

count aggregate all columns of DataFrame with excluding NaNs values, if need id as column use as_index=False parameter or reset_index():

freq = df.groupby('id', as_index=False).count()
print (freq)
   id  color  size
0   1      1     1
1   2      3     3
2   3      1     1
3   4      2     2

So if add NaNs in each column should be differences:

d = {'id': [1, 2, 3, 4, 2, 2, 4], 
     'color': ["r","r","b","b","g","g","r"],
      'size': [np.nan,2,1,2,1,3,4]}
df = pd.DataFrame(data=d)

freq = df.groupby('id', as_index=False).count()
print (freq)
   id  color  size
0   1      1     0
1   2      3     3
2   3      1     1
3   4      2     2

You can specify columns for count:

freq = df.groupby('id', as_index=False)['color'].count()
print (freq)
   id  color
0   1      1
1   2      3
2   3      1
3   4      2

If need count with NaNs:

freq = df.groupby('id').size().reset_index(name='count')
print (freq)
   id  count
0   1      1
1   2      3
2   3      1
3   4      2

d = {'id': [1, 2, 3, 4, 2, 2, 4], 
     'color': ["r","r","b","b","g","g","r"],
      'size': [np.nan,2,1,2,1,3,4]}
df = pd.DataFrame(data=d)

freq = df.groupby('id').size().reset_index(name='count')
print (freq)
   id  count
0   1      1
1   2      3
2   3      1
3   4      2

Thanks Bharath for pointed for another solution with value_counts, differences are explained here:

freq = df['id'].value_counts().rename_axis('id').to_frame('freq').reset_index()
print (freq)
   id  freq
0   2     3
1   4     2
2   3     1
3   1     1
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252