Getting column mean in groupby clause python pandas

Question

I have a dataset of actors and directors and the popularity of the movie that they have worked together in.

print (actors_director_df.head(3))

                 actor         director  popularity counter
0          Chris Pratt  Colin Trevorrow   32.985763       0
1  Bryce Dallas Howard  Colin Trevorrow   32.985763       0
2          Irrfan Khan  Colin Trevorrow   32.985763       0

I want to group by using actor and director because a pair can work in more than one film. which I successfully did use below query.

actor_director_grouped = actors_director_df.groupby(['actor','director']) \
                         .size() \
                         .reset_index(name='count') \
                         .sort_values(['count'], ascending=False) \
                         .head(10)

print (actor_director_grouped)

                      actor            director  count
3619         Clint Eastwood      Clint Eastwood     14
19272           Woody Allen         Woody Allen     12
9606            Johnny Depp          Tim Burton      8

But the popularity column goes missing in this DF.

What I want to do is to do a mean of popularity column after groupby and show the mean in front of actor and director along with the count of the number of movies they did together.

i.e. my ideal output would be something like this.

                      actor            director  popularity count
3619         Clint Eastwood      Clint Eastwood   32.985763    14
19272           Woody Allen         Woody Allen   5.1231231    12
9606            Johnny Depp          Tim Burton   3.1231231    8

Probably want to use `agg` with `mean` for popularity and `sum` for count — user3483203, Jun 07 '18 at 18:15
Can you post a slightly larger sample of your dataframe as a dictionary that is easy to recreate? (Also that actually shows more groups) — user3483203, Jun 07 '18 at 18:16
Not a dup but much similar : https://stackoverflow.com/questions/38174155/group-dataframe-and-get-sum-and-count — harvpan, Jun 07 '18 at 18:17

Anton vBR · Accepted Answer · 2018-06-07T18:53:49.187

4

Looking at your dataframe the counter columns seems unnecessary. Let us instead use the popularity column and make one mean and one count column:

import pandas as pd
import numpy as np

np.random.seed(444)

names = [
    'Robert Baratheon',
    'Jon Snow',
    'Daenerys Targaryen',
    'Theon Greyjoy',
    'Tyrion Lannister'
]

df = pd.DataFrame({
    'actor': np.random.choice(names, size=10, p = [0.2,0.2,0.2,0.1,0.3]),
    'director': np.random.choice(names, size=10, p = [0.4,0.1,0.1,0.1,0.3]),
    'popularity': np.random.randint(0,100, size=10),
    'counter': 0
})

df2 = df.groupby(['actor','director'])['popularity']\
        .agg(['count', 'mean'])\
        .reset_index()\
        .sort_values(by='mean', ascending=False)

print(df2)

Returns:

              actor          director  count  mean
0          Jon Snow  Robert Baratheon      2  53.5
5  Tyrion Lannister  Tyrion Lannister      2  49.0
2  Robert Baratheon  Tyrion Lannister      2  48.5
1  Robert Baratheon          Jon Snow      2  40.5
4     Theon Greyjoy  Tyrion Lannister      1  13.0
3     Theon Greyjoy  Robert Baratheon      1   7.0

edited Jun 07 '18 at 18:53

answered Jun 07 '18 at 18:52

Anton vBR

18,287
5
40
46

This is a much better answer. ~+1 – harvpan Jun 07 '18 at 19:03
@HarvIpan Thanks. Not 100% sure this is what OP wants as it can somtimes be hard to interpret. You have something nice going in your answer too. – Anton vBR Jun 07 '18 at 19:06
I agree. interpretation can be dubious sometimes. I specifically do not like the `.merge()` part of my answer. Your answer should be faster in that regards. – harvpan Jun 07 '18 at 19:09
Thanks guys. this is what I wanted. Now I have to understand the answer :D – Farooq Arshed Jun 07 '18 at 22:04
one last question. how do I get only the rows that have the count greater than 1? Doing df2 = df2[count > 1] is producing a lot of NaN in the mean and count column – Farooq Arshed Jun 07 '18 at 22:10
@FarooqArshed That is strange. Doing `df2 = df2[df2['count'] > 1]` should definately work. Are you sure you typed it in like me? – Anton vBR Jun 08 '18 at 06:25

harvpan · Answer 2 · 2018-06-07T18:44:17.643

I took to liberty to add some dummy data that would help understand the groupby clause better.

print(df)

Output:

                   actor           director  popularity  counter
0           Chris Pratt    Colin Trevorrow   32.985763        0
1   Bryce Dallas Howard    Colin Trevorrow   32.985763        0
2           Irrfan Khan    Colin Trevorrow   32.985763        0
3           Irrfan Khan    Colin Trevorrow   60.000000       12
4           Irrfan Khan       John Markson   10.000000       10
5           Irrfan Khan       Mark Johnson  100.000000        4

Then you need to groupby on actor and director and then find mean for popularity and sum for count.

g = df.groupby(['actor', 'director'], as_index=False)
count = g.size().reset_index(name='count')
grp = g.agg({'popularity':'mean'})
grp.merge(count)

Output:

                 actor         director  popularity  count
0  Bryce Dallas Howard  Colin Trevorrow   32.985763      1
1          Chris Pratt  Colin Trevorrow   32.985763      1
2          Irrfan Khan  Colin Trevorrow   46.492881      2
3          Irrfan Khan     John Markson   10.000000      1
4          Irrfan Khan     Mark Johnson  100.000000      1

@chrisz, the counter are `0s` in OP's question. I have kept them as-is. — harvpan, Jun 07 '18 at 18:28
@HarvIpan I added the counter column in the df. I am new to this thing. but chrisz is right. I want to count the number of movies they did together along with the popularity mean. — Farooq Arshed, Jun 07 '18 at 18:37

Getting column mean in groupby clause python pandas

2 Answers2