2

When aggregating data in Pandas I am able to return strings like "count", "sum", "mean", etc to aggregate data. Are there functions I can use instead of strings that would provide equivalent behavior. For example, if I try to use pd.Series.Count instead of count, the runtime takes a sizable hit.

import pandas as pd
import numpy as np

n = 10000000
df_nan = pd.DataFrame({"a": np.random.randint(0, 100, n*2),
                       "b": np.linspace(0, 100, n).tolist() + [None]*n})



%timeit df_nan.groupby("a").agg({"b": pd.Series.count})
1.63 s ± 28 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit df_nan.groupby("a").agg({"b": "count"})
479 ms ± 18.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Any idea what function I could return instead?

Max Kanter
  • 2,006
  • 6
  • 16

0 Answers0