38

In the example from the pandas documentation about the new .pipe() method for GroupBy objects, an .apply() method accepting the same lambda would return the same results.

In [195]: import numpy as np

In [196]: n = 1000

In [197]: df = pd.DataFrame({'Store': np.random.choice(['Store_1', 'Store_2'], n),
   .....:                    'Product': np.random.choice(['Product_1', 'Product_2', 'Product_3'], n),
   .....:                    'Revenue': (np.random.random(n)*50+10).round(2),
   .....:                    'Quantity': np.random.randint(1, 10, size=n)})

In [199]: (df.groupby(['Store', 'Product'])
   .....:    .pipe(lambda grp: grp.Revenue.sum()/grp.Quantity.sum())
   .....:    .unstack().round(2))

Out[199]: 
Product  Product_1  Product_2  Product_3
Store                                   
Store_1       6.93       6.82       7.15
Store_2       6.69       6.64       6.77

I can see how the pipe functionality differs from apply for DataFrame objects, but not for GroupBy objects. Does anyone have an explanation or examples of what can be done with pipe but not with apply for a GroupBy?

piRSquared
  • 285,575
  • 57
  • 475
  • 624
foglerit
  • 7,792
  • 8
  • 44
  • 64

2 Answers2

76

What pipe does is to allow you to pass a callable with the expectation that the object that called pipe is the object that gets passed to the callable.

With apply we assume that the object that calls apply has subcomponents that will each get passed to the callable that was passed to apply. In the context of a groupby the subcomponents are slices of the dataframe that called groupby where each slice is a dataframe itself. This is analogous for a series groupby.

The main difference between what you can do with a pipe in a groupby context is that you have available to the callable the entire scope of the the groupby object. For apply, you only know about the local slice.

Setup
Consider df

df = pd.DataFrame(dict(
    A=list('XXXXYYYYYY'),
    B=range(10)
))

   A  B
0  X  0
1  X  1
2  X  2
3  X  3
4  Y  4
5  Y  5
6  Y  6
7  Y  7
8  Y  8
9  Y  9

Example 1
Make the entire 'B' column sum to 1 while each sub-group sums to the same amount. This requires that the calculation be aware of how many groups exist. This is something we can't do with apply because apply wouldn't know how many groups exist.

s = df.groupby('A').B.pipe(lambda g: df.B / g.transform('sum') / g.ngroups)
s

0    0.000000
1    0.083333
2    0.166667
3    0.250000
4    0.051282
5    0.064103
6    0.076923
7    0.089744
8    0.102564
9    0.115385
Name: B, dtype: float64

Note:

s.sum()

0.99999999999999989

And:

s.groupby(df.A).sum()

A
X    0.5
Y    0.5
Name: B, dtype: float64

Example 2
Subtract the mean of one group from the values of another. Again, this can't be done with apply because apply doesn't know about other groups.

df.groupby('A').B.pipe(
    lambda g: (
        g.get_group('X') - g.get_group('Y').mean()
    ).append(
        g.get_group('Y') - g.get_group('X').mean()
    )
)

0   -6.5
1   -5.5
2   -4.5
3   -3.5
4    2.5
5    3.5
6    4.5
7    5.5
8    6.5
9    7.5
Name: B, dtype: float64
piRSquared
  • 285,575
  • 57
  • 475
  • 624
  • Please correct me if I'm wrong but pipe always wins over apply also in the respect that pipe is using vectorizing under the hood while apply doesn't. This shortens the operation time significantly. – Andrew Anderson Apr 02 '23 at 11:48
-4
print(df.groupby(['A'])['B'].apply(lambda l: l/l.sum()/df.A.nunique()))
  • 3
    This answer was flagged as [Low Quality](https://stackoverflow.com/help/review-low-quality), and could benefit from an explanation. Here are some guidelines for [How do I write a good answer?](https://stackoverflow.com/help/how-to-answer). Code only answers are **not considered good answers**, and are likely to be downvoted and/or deleted because they are **less useful** to a community of learners. It's only obvious to you. Explain what it does, and how it's different / **better** than existing answers. – Trenton McKinney May 28 '22 at 01:48