3

I'm unable to comment as I'm new to stackoverflow so can't ask directly in the thread, but I wanted to clarify the solution in this question:

# From Paul H
import numpy as np
import pandas as pd
np.random.seed(0)
df = pd.DataFrame({'state': ['CA', 'WA', 'CO', 'AZ'] * 3,
                   'office_id': list(range(1, 7)) * 2,
                   'sales': [np.random.randint(100000, 999999)
                             for _ in range(12)]})
state_office = df.groupby(['state', 'office_id']).agg({'sales': 'sum'})
# Change: groupby state_office and divide by sum
state_pcts = state_office.groupby(level=0).apply(lambda x:
                                                 100 * x / float(x.sum()))

I understand multi-index selection (level 0 v. level 1), but I'm not clear on what each x in the lambda functions refers to. The x in x.sum() would to me refer to level = 0 (summing all results within each grouping at level = 0) but the x in the 100 * x appears to refer to each individual result within the groupby object (not the index level = 0 grouping).

Sorry for such a basic question but an explanation would be very useful!

lczapski
  • 4,026
  • 3
  • 16
  • 32
jbachlombardo
  • 141
  • 2
  • 13

2 Answers2

5

This is the state_office DataFrame:

state_office
Out: 
                  sales
state office_id        
AZ    2          589661
      4          339834
      6          201054
CA    1          760950
      3          935865
      5          464993
CO    1          737207
      3          154900
      5          277555
WA    2          510215
      4          640508
      6          557411

If you group this on level=0, the groups will be:

                  sales
state office_id        
AZ    2          589661
      4          339834
      6          201054

                  sales
state office_id        
CA    1          760950
      3          935865
      5          464993

                  sales
state office_id        
CO    1          737207
      3          154900
      5          277555

When you use groupby.apply with a custom function, these groups will be the inputs of this function (x in lambda x). I will use the term group instead of x to be more explicit.

The thing that's confusing you is called broadcasting. If for a particular group you use group / group.sum() then it will divide each element in that group by the sum. Let's take the first group:

                  sales
state office_id        
AZ    2          589661
      4          339834
      6          201054

group.sum() returns:

group.sum()
Out: 
sales    1130549
dtype: int64

Since it has only one element, float(x.sum()) will return 1130549.0. (A cleaner version would be selecting the sales Series on the GroupBy object, then applying the function. state_office.groupby(level=0)['sales'].apply(lambda x: 100 * x / x.sum()) Here, x is a Series so x.sum() will be a scalar so you won't need float(x.sum())).

If you divide each element by this value, you get the desired result:

group / group.sum()
Out: 
                    sales
state office_id          
AZ    2          0.521570
      4          0.300592
      6          0.177837

pandas/numpy at this point figures out that if the shapes are not the same but has one axis in common then the operation should be done based on that (more basically, if you pass three numbers than it will do element-wise division but since you pass only one number it knows that you want to divide each of these three numbers by this single number).

ayhan
  • 70,170
  • 20
  • 182
  • 203
1

Let's read the documentation together. (Source)

GroupBy.apply(func, *args, **kwargs)[source] Apply function func group-wise and combine the results together.

Looking into func from the signature above:

func : function

A callable that takes a dataframe as its first argument, and returns a dataframe, a series or a scalar. In addition the callable may take positional and keyword arguments

In the OP's example, lambda x: 100 * x / float(x.sum() is func in the documentation. From the documentation, x here is a dataframe, a group of the groups after the groupby call.

Tai
  • 7,684
  • 3
  • 29
  • 49