6

Using pandas v1.1.0.

In the pandas docs there is a nice example on how to use numba to speed up a rolling.apply() operation here

import pandas as pd
import numpy as np

def mad(x):
    return np.fabs(x - x.mean()).mean()

df = pd.DataFrame({"A": np.random.randn(100_000)},
                  index=pd.date_range('1/1/2000', periods=100_000, freq='T')
).cumsum()

df.rolling(10).apply(mad, engine="numba", raw=True)

I would like to adapt it to work for a groupby operation:

df['day'] = df.index.day
df.groupby('day').agg(mad)

works fine.

But

df.groupby('day').agg(mad, engine='numba')

errors and gives

---------------------------------------------------------------------------
NumbaUtilError                            Traceback (most recent call last)
<ipython-input-21-ee23f1eec685> in <module>
----> 1 df.groupby('day').agg(mad, engine='numba')

~\AppData\Local\Continuum\anaconda3\envs\ds-cit-dev\lib\site-packages\pandas\core\groupby\generic.py in aggregate(self, func, engine, engine_kwargs, *args, **kwargs)
    939 
    940         if maybe_use_numba(engine):
--> 941             return self._python_agg_general(
    942                 func, *args, engine=engine, engine_kwargs=engine_kwargs, **kwargs
    943             )

~\AppData\Local\Continuum\anaconda3\envs\ds-cit-dev\lib\site-packages\pandas\core\groupby\groupby.py in _python_agg_general(self, func, engine, engine_kwargs, *args, **kwargs)
   1068 
   1069             if maybe_use_numba(engine):
-> 1070                 result, counts = self.grouper.agg_series(
   1071                     obj,
   1072                     func,

~\AppData\Local\Continuum\anaconda3\envs\ds-cit-dev\lib\site-packages\pandas\core\groupby\ops.py in agg_series(self, obj, func, engine, engine_kwargs, *args, **kwargs)
    623 
    624         if maybe_use_numba(engine):
--> 625             return self._aggregate_series_pure_python(
    626                 obj, func, *args, engine=engine, engine_kwargs=engine_kwargs, **kwargs
    627             )

~\AppData\Local\Continuum\anaconda3\envs\ds-cit-dev\lib\site-packages\pandas\core\groupby\ops.py in _aggregate_series_pure_python(self, obj, func, engine, engine_kwargs, *args, **kwargs)
    681 
    682         if maybe_use_numba(engine):
--> 683             numba_func, cache_key = generate_numba_func(
    684                 func, engine_kwargs, kwargs, "groupby_agg"
    685             )

~\AppData\Local\Continuum\anaconda3\envs\ds-cit-dev\lib\site-packages\pandas\core\util\numba_.py in generate_numba_func(func, engine_kwargs, kwargs, cache_key_str)
    215     nopython, nogil, parallel = get_jit_arguments(engine_kwargs)
    216     check_kwargs_and_nopython(kwargs, nopython)
--> 217     validate_udf(func)
    218     cache_key = (func, cache_key_str)
    219     numba_func = NUMBA_FUNC_CACHE.get(

~\AppData\Local\Continuum\anaconda3\envs\ds-cit-dev\lib\site-packages\pandas\core\util\numba_.py in validate_udf(func)
    177         or udf_signature[:min_number_args] != expected_args
    178     ):
--> 179         raise NumbaUtilError(
    180             f"The first {min_number_args} arguments to {func.__name__} must be "
    181             f"{expected_args}"

NumbaUtilError: The first 2 arguments to mad must be ['values', 'index']

I'm guessing with engine=numba it expects the data to be slightly different.

tupui
  • 5,738
  • 3
  • 31
  • 52
Ray Bell
  • 1,508
  • 4
  • 18
  • 45
  • Original link of pandas doc is down. Here is the new link [Numba (JIT compilation)](https://pandas.pydata.org/pandas-docs/stable/user_guide/enhancingperf.html#numba-jit-compilation) – dyang Oct 25 '22 at 00:13

3 Answers3

4

Have you tried Bodo for this? It's built on top of Numba and supports Pandas directly. For example:

pip install bodo
import pandas as pd
import numpy as np
import bodo

def mad(x):
    return np.fabs(x - x.mean()).mean()

np.random.seed(0)
df = pd.DataFrame({"A": np.random.randn(100_000)},
                  index=pd.date_range('1/1/2000', periods=100_000, freq='T')
).cumsum()

@bodo.jit(distributed=False)
def f(df):
    df['day'] = df.index.day
    df2 = df.groupby('day').agg(mad)
    return df2

df2 = f(df)
print(df2)

This example seems too small to benefit from compilation but this may help your real use case.

Ehsan
  • 86
  • 3
3

Had this problem myself. Apparently, to use pandas + numba engine you are required to implement custom functions in the format of f(value, index).

as per the documentation (GroupBy.transform):

If the 'numba' engine is chosen, the function must be a user defined function with values and index as the first and second arguments respectively in the function signature. Each group’s index will be passed to the user defined function and optionally available for use.

I had a simple function f(x) returning an int that I wanted to use inside a groupby. All it took to make it work with numba was amending the function to be f(values, index) so that the numba routine would have a valid parameter to pass the index to the function.

previous function (works fine, but not with numba):

def equal_weight(arr) -> int:
    '''
    returns a float of 1/n where 'n' is the number of rows
    '''
    return 1 / len(arr)

new function, compatible with numba engine:

def equal_weight(values, index) -> int:
    '''
    returns a float of 1/n where 'n' is the number of rows
    '''
    return 1 / len(values)
Caio Castro
  • 521
  • 4
  • 13
0

Try this

df.groupby('day').agg(mad(df.groupby(day)), engine='numba') Not sure about it but it's saying the 1st 2 arguments must be ["values", 'index'] I think it'll work with the data frame .

Ammar
  • 66
  • 4
  • Tried `g = df.groupby('day')` then `g.agg(mad(g), engine='numba')` and got `Traceback ... ValueError: Unable to coerce to Series, length must be 1: given 31` – Ray Bell Aug 04 '20 at 21:16