1

In pandas, I want to group by two columns and then do a count. After that I want to drop on of the columns I grouped on and calculate the average for the counts. I know how to do this in sql easily, but run into issues in Pandas. I can't drop one of the columns I grouped on previously. Do anyone know how to do this nicely?

(Just care about the end result, the procedure doesn't have to be this way if there is a better one)

Eg:

Name, City
Anna, New York
Carl, New York
Carl, New York
Steven, London
Carl, London
Anna, Paris
Carl, Paris
Carl, Paris

Group by 'Name' and 'City', then count:

Name, City, Count
Anna, New York, 1
Carl, New York, 2
Steven, London, 1
Carl, London, 1
Anna, Paris, 1
Carl, Paris, 2

Drop 'City' and take the average count for each 'Name':

Name, Count
Anna, 1
Carl, 1.66667
Steven, 1

2 Answers2

0

Use GroupBy.size for MultiIndex Series and then group by first level Name and aggregate mean, last create DataFrame by Series.reset_index:

df1 = df.groupby(['Name','City']).size().groupby(level=0).mean().reset_index(name='Count')
print (df1)
     Name     Count
0    Anna  1.000000
1    Carl  1.666667
2  Steven  1.000000
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
0

Your operation is equivalent to computing a crosstab and getting the mean per row of the non-zero values:

out = (pd.crosstab(df['Name'], df['City'])
         .replace(0, float('nan')).mean(axis=1)
         .reset_index(name='Count') # optional
      )

output:

     Name     Count
0    Anna  1.000000
1    Carl  1.666667
2  Steven  1.000000
mozway
  • 194,879
  • 13
  • 39
  • 75