2

Suppose we have a table of customers and their spending.

import pandas as pd
df = pd.DataFrame({
    "Name":  ["Alice", "Bob", "Bob", "Charles"],
    "Spend": [3, 5, 7, 9]
})
LIMIT = 6

For each customer, we may compute the fraction of his spending that is larger than 6, using the apply method:

df.groupby("Name").apply(
    lambda grp: len(grp[grp["Spend"] > LIMIT]) / len(grp)
)

Name
Alice      0.0
Bob        0.5
Charles    1.0

However, the apply method is just a loop, which is slow if there are many customers.

Question: Is there a faster way, which presumably uses vectorization?

As of version 0.23.4, SeriesGroupBy does not support comparison operators:

(df.groupby("Name") ["Spend"] > LIMIT).mean()

TypeError: '>' not supported between instances of 'SeriesGroupBy' and 'int'

The code below results in a null value for Alice:

df[df["Spend"] > LIMIT].groupby("Name").size() / df.groupby("Name").size()

Name
Alice      NaN
Bob        0.5
Charles    1.0

The code below gives the correct result, but it requires us to either modify the table, or make a copy to avoid modifying the original.

df["Dummy"] = 1 * (df["Spend"] > LIMIT)
df.groupby("Name") ["Dummy"] .sum() / df.groupby("Name").size()
visitor
  • 672
  • 6
  • 17
  • I think your should be fine with `groupby`+`apply`. You are, however, doing a costly operation inside your lambda function. Don't need to do the mask, filter using it, and using python's built-in `len` to calculate how many Trues you can. Just sum the mask directly: `df.groupby('Name').Spend.apply(lambda s: (s>LIMIT).sum()/s.size)` – rafaelc Dec 09 '18 at 06:19
  • Thank you very much for edit my poor english. – jezrael Dec 18 '18 at 07:50

1 Answers1

2

Groupby does not use vectorization, but it has aggregate functions that are optimized with Cython.

You can take the mean:

(df["Spend"] > LIMIT).groupby(df["Name"]).mean()

df["Spend"].gt(LIMIT).groupby(df["Name"]).mean()

Or use div to replace NaN with 0:

df[df["Spend"] > LIMIT].groupby("Name").size() \
.div(df.groupby("Name").size(), fill_value = 0)

df["Spend"].gt(LIMIT).groupby(df["Name"]).sum() \
.div(df.groupby("Name").size(), fill_value = 0)

Each of the above will yield

Name
Alice      0.0
Bob        0.5
Charles    1.0
dtype: float64

Performance

Depends on the number of rows and number of rows filtered per condition, so it's best to test on real data.

np.random.seed(123)

N = 100000
df = pd.DataFrame({
    "Name":  np.random.randint(1000, size = N),
    "Spend": np.random.randint(10, size = N)
})
LIMIT = 6

In [10]: %timeit df["Spend"].gt(LIMIT).groupby(df["Name"]).mean()
6.16 ms ± 332 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [11]: %timeit df[df["Spend"] > LIMIT].groupby("Name").size().div(df.groupby("Name").size(), fill_value = 0)
6.35 ms ± 95.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [12]: %timeit df["Spend"].gt(LIMIT).groupby(df["Name"]).sum().div(df.groupby("Name").size(), fill_value = 0)
9.66 ms ± 365 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

# RafaelC comment solution
In [13]: %timeit df.groupby("Name")["Spend"].apply(lambda s: (s > LIMIT).sum() / s.size)
400 ms ± 27.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [14]: %timeit df.groupby("Name")["Spend"].apply(lambda s: (s > LIMIT).mean())
328 ms ± 6.12 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

This NumPy solution is vectorized, but a bit complicated:

In [15]: %%timeit
    ...: i, r = pd.factorize(df["Name"])
    ...: a = pd.Series(np.bincount(i), index = r)
    ...: 
    ...: i1, r1 = pd.factorize(df["Name"].values[df["Spend"].values > LIMIT])
    ...: b = pd.Series(np.bincount(i1), index = r1)
    ...: 
    ...: df1 = b.div(a, fill_value = 0)
    ...: 
5.05 ms ± 82.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
visitor
  • 672
  • 6
  • 17
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252