Improve performances (vectorize?) pandas.groupby.aggregate

Question

I'm trying to improve the performances of a pandas.groupby.aggregate operation using a custom aggregating function. I noticed that - correct me if I'm wrong - pandas calls the aggregating function on each block in sequence (I suspect it to be a simple for-loop).

Since pandas is heavily based on numpy, is there a way to speed up the calculation using numpy's vectorization features?

My code

In my code I need to aggregate wind data averaging samples together. While averaging wind-speeds is trivial, averaging wind directions requires a more ad-hoc code (e.g. the average of 1deg and 359deg is 0deg, not 180deg).

What my aggregating function does is:

remove NaNs
return NaN if no other value is present
check if a special flag indicating variable wind direction is present. If it is, return the flag
average the wind directions with a vector-averaging algorithm

The function is:

def meandir(x):
    '''
    Parameters
    ----------
    x : pandas.Series
        pandas series to be averaged

    Returns
    -------
    float
        averaged wind direction
    '''

    # Removes the NaN from the recording
    x = x.dropna()

    # If the record is empty, return NaN
    if len(x)==0:
        return np.nan

    # If the record contains variable samples (990) return variable (990)
    elif np.any(x == 990):
        return 990

    # Otherwise sum the vectors and return the angle
    else:
        angle = np.rad2deg(
                           np.arctan2(
                                   np.sum(np.sin(np.deg2rad(x))),
                                   np.sum(np.cos(np.deg2rad(x)))
                                     )
                          )

        #Wrap angles from (-pi,pi) to (0,360)
        return (angle + 360) % 360

you can test it with

from timeit import repeat
import pandas as pd
import numpy as np

N_samples = int(1e4)
N_nan = N_var = int(0.02 * N_samples)

# Generate random data
data = np.random.rand(N_samples,2) * [30, 360]
data[np.random.choice(N_samples, N_nan), 1] = np.nan
data[np.random.choice(N_samples, N_var), 1] = 990

# Create dataset
df = pd.DataFrame(data, columns=['WindSpeed', 'WindDir'])
df.index = pd.date_range(start='2000-01-01 00:00', periods=N_samples, freq='10min')

# Run groupby + aggregate
grouped = df.groupby(pd.Grouper(freq='H'))   # Data from 14.30 to 15.29 are rounded to 15.00
aggfuns1 = {'WindSpeed': np.mean, 'WindDir':meandir}
aggfuns2 = {'WindSpeed': np.mean, 'WindDir':np.mean}

res = repeat(stmt='grouped.agg(aggfuns1)', globals=globals(), number=1, repeat=10)
print(f'With custom aggregating function {min(res)*1000:.2f} ms')

res = repeat(stmt='grouped.agg(aggfuns2)', globals=globals(), number=1, repeat=10)
print(f'Without custom aggregating function {min(res)*1000:.2f} ms')

which on my PC for N_samples=1e4 outputs:

With custom aggregating function 1500.79 ms
Without custom aggregating function 2.08 ms

with the custom aggregating function being 750 times slower and with N_samples=1e6 outputs:

With custom aggregating function 142967.17 ms
Without custom aggregating function 21.92 ms

with the custom aggregating function being 6500 times slower!

Is there a way to speed up this line of code?

Quick question, what are you using for the `repeat` function? which specific API? — Akshay Sehgal, Dec 10 '20 at 20:17
@AkshaySehgal `timeit`. The `import` line got cut when I pasted the code. I edit immediately to include it — Luca, Dec 10 '20 at 20:19
@PierreD it indicates a variable wind direction. The organization who distributes the data (NOAA) chose it. You can see it more or less as a NaN. But while a NaN indicates a missing sample, that indicates a sample that WAS measured, but for which the measurement is "variable" — Luca, Dec 10 '20 at 20:33
oh, I see: so if present, then you'd like the mean in that bin to be also variable. — Pierre D, Dec 10 '20 at 20:39
Try isolating parts of function to find bottleneck. Is `if/else` slowing you, the nested trig calls, or final `angle` calc? One check is to clean data of groups with all `NaN` and at least one `990` then run only `else` block. Then, try only returning `angle`. — Parfait, Dec 11 '20 at 04:59

Pierre D · Accepted Answer · 2020-12-11T02:09:07.657

The key is to try to vectorize everything you can on the whole df, and let groupby use only builtin methods.

Here is a way to do that. The trick is to convert the angles to complex numbers, which numpy will happily sum (and groupby too, but groupby will refuse to mean()). So, we convert the angles to complex, sum, then convert back to angles. The same "funny mean" of angles is used as in your code and described on the Wikipedia page you reference.

About the handling of the special value (990), it can be vectorized too: comparing s.groupby(...).count() with .replace(val, nan).groupby(...).count() finds all the groups where there is at least one of those.

Anyway, here goes:

def to_complex(s):
    return np.exp(np.deg2rad(s) * 1j)

def to_angle(s):
    return np.angle(s, deg=True) % 360

def mask_val(s, grouper, val=990):
    return s.groupby(grouper).count() != s.replace(val, np.nan).groupby(grouper).count()

def myagg(df, grouper, val=990, winddir='WindDir'):
    s = df[winddir]
    mask = mask_val(s, grouper, val)
    gb = to_complex(s).groupby(grouper)
    s = gb.sum()
    cnt = gb.count()
    s = to_angle(s) * (cnt / cnt)  # put NaN where all NaNs
    s[mask] = val
    
    # other columns
    agg = df.groupby(grouper).mean()
    agg[winddir] = s

    return agg

Application:

For convenience, I put your example generation into a function gen_example(N_samples).

df = gen_example(50)
myagg(df, pd.Grouper(freq='H'))

Out[ ]:
                     WindSpeed     WindDir
2000-01-01 00:00:00  12.991717  354.120464
2000-01-01 01:00:00  15.743056   60.813629
2000-01-01 02:00:00  14.593927  245.487383
2000-01-01 03:00:00  17.836368  131.493675
2000-01-01 04:00:00  18.987296   27.150359
2000-01-01 05:00:00  16.415725  194.923399
2000-01-01 06:00:00  20.881816  990.000000
2000-01-01 07:00:00  15.033480   44.626018
2000-01-01 08:00:00  16.276834   29.252459

Speed:

df = gen_example(10_000)
%timeit myagg(df, pd.Grouper(freq='H'))

Out[ ]:
6.76 ms ± 12.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

df = gen_example(1e6)
%timeit myagg(df, pd.Grouper(freq='H'))

Out[ ]:
189 ms ± 425 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Testing:

idx = [0] * 4
grouper = pd.Grouper(level=0)

myagg(pd.DataFrame({'WindDir': [170, 170, 178, 182]}, index=idx), grouper)
      WindDir
0  174.998473

myagg(pd.DataFrame({'WindDir': [330, 359, 1, 40]}, index=idx), grouper)
    WindDir
0  2.251499

myagg(pd.DataFrame({'WindDir': [330, 359, 1, np.nan]}, index=idx), grouper)
      WindDir
0  350.102878

myagg(pd.DataFrame({'WindDir': [np.nan, np.nan, np.nan, np.nan]}, index=idx), grouper)
   WindDir
0      NaN

myagg(pd.DataFrame({'WindDir': [330, 990, 1, np.nan]}, index=idx), grouper)
   WindDir
0    990.0

@jakub: because of the reasons explained quite eloquently on the Wikipedia page cited by the OP: [Mean of circular quantities](https://en.wikipedia.org/wiki/Mean_of_circular_quantities). I missed that initially, and thought, oh well, can just do some appropriate modulo. Not so. Example: what's the angle mean of `(0, 90, 180, 270)` degrees? — Pierre D, Dec 11 '20 at 02:21
Thank you! This is very useful -- there is a wikipedia page for everything. — jkr, Dec 11 '20 at 04:50
Thank you very much. So the lesson here is "usa pandas builtin (and vectorized) funtion as much as you can" and do not use `pandas.Groupby.aggreagate` with custom aggregating function. I must admit I was hoping to learn how to speed up my custom function, rather then revert to a convoluted use of the builtin ones. However, I accept the answer as it workaround the problem in a very clever way a - most importantly - reach the goal. kudos! — Luca, Dec 12 '20 at 13:19

Improve performances (vectorize?) pandas.groupby.aggregate

My code

1 Answers1