Suppose we have a table of customers and their spending.
import pandas as pd
df = pd.DataFrame({
"Name": ["Alice", "Bob", "Bob", "Charles"],
"Spend": [3, 5, 7, 9]
})
LIMIT = 6
For each customer, we may compute the fraction of his spending that is larger than 6, using the apply
method:
df.groupby("Name").apply(
lambda grp: len(grp[grp["Spend"] > LIMIT]) / len(grp)
)
Name
Alice 0.0
Bob 0.5
Charles 1.0
However, the apply
method is just a loop, which is slow if there are many customers.
Question: Is there a faster way, which presumably uses vectorization?
As of version 0.23.4, SeriesGroupBy does not support comparison operators:
(df.groupby("Name") ["Spend"] > LIMIT).mean()
TypeError: '>' not supported between instances of 'SeriesGroupBy' and 'int'
The code below results in a null value for Alice:
df[df["Spend"] > LIMIT].groupby("Name").size() / df.groupby("Name").size()
Name
Alice NaN
Bob 0.5
Charles 1.0
The code below gives the correct result, but it requires us to either modify the table, or make a copy to avoid modifying the original.
df["Dummy"] = 1 * (df["Spend"] > LIMIT)
df.groupby("Name") ["Dummy"] .sum() / df.groupby("Name").size()