2
              Count
League  Result         
EPL     H      16
        D      9
        A      10
Champ   H      67
        D      15
        A      57
        H      87
La Liga D      35
        A      40
        

I have a breakdown of football results for different leagues and a count of how many times that result occurred.

I want to see the proportion of home wins, draws, away wins as a percentage of the total games played. I have seen a solution to this below:

df.groupby("League").apply(lambda g: (g/g.sum()*100)

At first glance, this made sense, but what exactly is g here? I assumed it was the H, D or A count and then the g.sum() summed all of the H,D,A counts grouped by each division. But, if g is just a value, how are we calling the method g.sum()? What exactly is g here?

ALollz
  • 57,915
  • 7
  • 66
  • 89
the man
  • 1,131
  • 1
  • 8
  • 19

2 Answers2

3

g is a DataFrame. Since you group on 'League' you will split the DataFrame up into separate chunks which contain the unique values of 'League'. To illustrate this, we can iterate over the GroupBy object.

for idx, g in df.groupby('League'):  # `idx` is the unique group key
    print(g, '\n')

               Count
League Result       
Champ  H          67
       D          15
       A          57
       H          87

               Count
League Result       
EPL    H          16
       D           9
       A          10

                Count
League  Result       
La Liga D          35
        A          40

The apply then acts to apply your function to each of these DataFrame separately. Calling g.sum() will give you a Series that sums each column within the group.

for idx, g in df.groupby('League'):
    print(g.sum(), '\n')

Count    226
dtype: int64 

Count    35
dtype: int64 

Count    75
dtype: int64 
ALollz
  • 57,915
  • 7
  • 66
  • 89
  • @theman happy it helped! Since the `groupby` object can be a bit opaque, I found gaining intuition (and debugging) easiest to just iterate like the above. Conceptually, it's no different from what pandas does. That being said, when data get large pandas has optimized many of these operations so converting the straight-forward above code (or even your apply) into something a lot more performant that doesn't loop, like in YOBEN_S's solution, is generally preferred. – ALollz May 29 '20 at 19:04
1

We usually do transform

df.Count=df.Count*100/df.groupby(level=0)['Count'].transform('sum')

g in your function is the dataframe

df.groupby(level=0).apply(lambda  x : type(x))
Out[607]: 
League
Champ      <class 'pandas.core.frame.DataFrame'>
EPL        <class 'pandas.core.frame.DataFrame'>
La Liga    <class 'pandas.core.frame.DataFrame'>
dtype: object
BENY
  • 317,841
  • 20
  • 164
  • 234
  • Where by "the dataframe" you mean one of the groups of the groupby). – Igor Rivin May 29 '20 at 18:41
  • @IgorRivin which, is a sub `DataFrame` of the group, basically. `.apply(lambda x: type(x))` makes it rather clear. It helps to know what `type` of object you're working with since the methods applied will differ greatly. – r.ook May 29 '20 at 18:43
  • @r.ook Yes, I know what a grouoby object is. However, Yoben's answer may have confused the OP, since the latter probably does NOT know that. – Igor Rivin May 29 '20 at 18:48
  • Who is the "we" in the phrase, "we usually do a transform"? From where I sit, `apply` and `agg` are *much* more common – Paul H May 29 '20 at 18:56
  • 1
    @PaulH apply is common , when you have large dataframe , the running time will increase a lot . check https://stackoverflow.com/questions/54432583/when-should-i-ever-want-to-use-pandas-apply-in-my-code, hope you can be one of us. ~ – BENY May 29 '20 at 18:58