2

I'm missing information on what would be the most efficient (read: fastest) way of using user-defined functions in a groupby-apply setting in either Pandas or Numpy. I have done some of my own tests but am wondering if there are other methods out there that I have not come across yet.

Take the following example DataFrame:

import numpy as np
import pandas as pd

idx = pd.MultiIndex.from_product([range(0, 100000), ["a", "b", "c"]], names = ["time", "group"])
df = pd.DataFrame(columns=["value"], index = idx)

np.random.seed(12)
df["value"] = np.random.random(size=(len(idx),))

print(df.head())

               value
time group          
0    a      0.154163
     b      0.740050
     c      0.263315
1    a      0.533739
     b      0.014575

I would like to calculate (for example, the below could be any arbitrary user-defined function) the percentage change over time per group. I could do this in a pure Pandas implementation as follows:

def pct_change_pd(series, num):
    return series / series.shift(num) - 1

out_pd = df.sort_values(['group', 'time']).groupby(["group"]).apply(pct_change_pd, num=1)

But I could also modify the function and apply it over a numpy array:

def shift_array(arr, num, fill_value=np.nan):
    if num >= 0:
        return np.concatenate((np.full(num, fill_value), arr[:-num]))
    else:
        return np.concatenate((arr[-num:], np.full(-num, fill_value)))
    
def pct_change_np(series, num):
    idx = series.index

    arr = series.values.flatten()
    arr_out = arr / shift_array(arr, num=num) - 1
    return pd.Series(arr_out, index=idx)

out_np = df.sort_values(['group', 'time']).groupby(["group"]).apply(pct_change_np, num=1)
out_np = out_np.reset_index(level=2, drop=True)

From my testing, it seems that the numpy method, even with its additional overhead of converting between np.array and pd.Series, is faster.

Pandas:

%%timeit
out_pd = df.sort_values(['group', 'time']).groupby(["group"]).apply(pct_change_pd, num=1)

113 ms ± 548 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Numpy:

%%timeit
out_np = df.sort_values(['group', 'time']).groupby(["group"]).apply(pct_change_np, num=1)
out_np = out_np.reset_index(level=2, drop=True)

94.7 ms ± 642 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

As the index grows and the user-defined function becomes more complex, the Numpy implementation will continue to outperform the Pandas implementation more and more. However, I wonder if there are alternative methods to achieving similar results that are even faster. I'm specifically after another (more efficient) groupby-apply methodology that would allow me to work with any arbitrary user-defined function, not just with the shown example of calculating the percentage change. Would be happy to hear if they exist!

Menno Van Dijk
  • 863
  • 6
  • 24

2 Answers2

1

Often the name of the game is to try to use whatever functions are in the toolbox (often optimized and C compiled) rather than applying your own pure Python function. For example, one alternative would be:

def f1(df, num=1):
    grb_kwargs = dict(sort=False, group_keys=False)  # avoid redundant ops
    z = df.sort_values(['group', 'time'])
    return z / z.groupby('group', **grb_kwargs).transform(pd.Series.shift, num) - 1

That is about 32% faster than the .groupby('group').apply(pct_change_pd, num=1). On your system, it would yield around 85ms.

And then, there is the trick of doing your "expensive" calculation on the whole df, but masking out the parts that are spillovers from other groups:

def f2(df, num=1):
    grb_kwargs = dict(sort=False, group_keys=False)  # avoid redundant ops
    z = df.sort_values(['group', 'time'])
    z2 = z.shift(num)
    gid = z.groupby('group', **grb_kwargs).ngroup()
    z2.loc[gid != gid.shift(num)] = np.nan
    return z / z2 - 1

That one is fully 2.1x faster (on your system would be around 52.8ms).

Finally, when there is no way to find some vectorized function to use directly, then you can use numba to speed up your code (that can then be written with loops to your heart's content)... A classic example is cumulative sum with caps, as in this SO post and this one.

Pierre D
  • 24,012
  • 7
  • 60
  • 96
  • Hi, thanks for the rather extensive answer! I'm fully aware that using built in functionality will allow for this specific use-case to be faster, but calculating percentage change is only one of many user-defined functions that I would like to use. I was trying to really ask what efficient groupby-apply methodologies exist that accept **any** arbitrary user-defined function. Thank you for you answer nevertheless! – Menno Van Dijk Dec 27 '20 at 16:57
  • yep, no free lunch: if in Python territory, then you have GIL and all kinds of things. In that case, numba is your friend (also terribly effective on GPUs), **but**: you still must adapt your code and be careful (no pandas arguments, Series or DataFrame, sometimes finicky rules re. signatures, etc.) It is sometimes worth it, but if you can express your operation as vectorized ones instead, then that will be the clear winner. – Pierre D Dec 27 '20 at 17:05
0

Your first function and using .apply() gives me this result:

In [42]: %%timeit
    ...: out_pd = df.sort_values(['group', 'time']).groupby(["group"]).apply(pct_change_pd, num=1)
155 ms ± 887 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Using groups, the time goes to 56ms.

%%timeit
num=1
outpd_list = []
for g in dfg.groups.keys():
    gc = dfg.get_group(g)
    outpd_list.append(gc['value'] / gc['value'].shift(num) - 1)
out_pd = pd.concat(outpd_list, axis=0)

56 ms ± 821 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

And if you change this one line in the above code to use built in function you get a bit more time savings

outpd_list.append(gc['value'].pct_change(num))
41.2 ms ± 283 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Jonathan Leon
  • 5,440
  • 2
  • 6
  • 14