I'm missing information on what would be the most efficient (read: fastest) way of using user-defined functions in a groupby-apply setting in either Pandas or Numpy. I have done some of my own tests but am wondering if there are other methods out there that I have not come across yet.
Take the following example DataFrame:
import numpy as np
import pandas as pd
idx = pd.MultiIndex.from_product([range(0, 100000), ["a", "b", "c"]], names = ["time", "group"])
df = pd.DataFrame(columns=["value"], index = idx)
np.random.seed(12)
df["value"] = np.random.random(size=(len(idx),))
print(df.head())
value
time group
0 a 0.154163
b 0.740050
c 0.263315
1 a 0.533739
b 0.014575
I would like to calculate (for example, the below could be any arbitrary user-defined function) the percentage change over time per group. I could do this in a pure Pandas implementation as follows:
def pct_change_pd(series, num):
return series / series.shift(num) - 1
out_pd = df.sort_values(['group', 'time']).groupby(["group"]).apply(pct_change_pd, num=1)
But I could also modify the function and apply it over a numpy array:
def shift_array(arr, num, fill_value=np.nan):
if num >= 0:
return np.concatenate((np.full(num, fill_value), arr[:-num]))
else:
return np.concatenate((arr[-num:], np.full(-num, fill_value)))
def pct_change_np(series, num):
idx = series.index
arr = series.values.flatten()
arr_out = arr / shift_array(arr, num=num) - 1
return pd.Series(arr_out, index=idx)
out_np = df.sort_values(['group', 'time']).groupby(["group"]).apply(pct_change_np, num=1)
out_np = out_np.reset_index(level=2, drop=True)
From my testing, it seems that the numpy method, even with its additional overhead of converting between np.array
and pd.Series
, is faster.
Pandas:
%%timeit
out_pd = df.sort_values(['group', 'time']).groupby(["group"]).apply(pct_change_pd, num=1)
113 ms ± 548 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Numpy:
%%timeit
out_np = df.sort_values(['group', 'time']).groupby(["group"]).apply(pct_change_np, num=1)
out_np = out_np.reset_index(level=2, drop=True)
94.7 ms ± 642 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
As the index grows and the user-defined function becomes more complex, the Numpy implementation will continue to outperform the Pandas implementation more and more. However, I wonder if there are alternative methods to achieving similar results that are even faster. I'm specifically after another (more efficient) groupby-apply methodology that would allow me to work with any arbitrary user-defined function, not just with the shown example of calculating the percentage change. Would be happy to hear if they exist!