Select the max row per group - pandas performance issue

Question

I'm selecting one max row per group and I'm using groupby/agg to return index values and select the rows using loc.

For example, to group by "Id" and then select the row with the highest "delta" value:

selected_idx = df.groupby("Id").apply(lambda df: df.delta.argmax())
selected_rows = df.loc[selected_idx, :]

However, it's so slow this way. Actually, my i7/16G RAM laptop hangs when I'm using this query on 13 million rows.

I have two questions for experts:

How can I make this query run fast in pandas? What am I doing wrong?
Why is this operation so expensive?

[Update] Thank you so much for @unutbu 's analysis! sort_drop it is! On my i7/32GRAM machine, groupby+idxmax hangs for nearly 14 hours (never return a thing) however sort_drop handled it LESS THAN A MINUTE!

I still need to look at how pandas implements each method but problems solved for now! I love StackOverflow.

Related: [How do I find: Is the first non-NaN value in each column the maximum for that column in a DataFrame?](https://stackoverflow.com/questions/52349507/how-do-i-find-is-the-first-non-nan-value-in-each-column-the-maximum-for-that-co). Would be nice to also see how those answers' performance compares and scales. — smci, Sep 16 '18 at 01:00

unutbu · Accepted Answer · 2018-06-26T18:56:55.210

The fastest option depends not only on length of the DataFrame (in this case, around 13M rows) but also on the number of groups. Below are perfplots which compare a number of ways of finding the maximum in each group:

If there an only a few (large) groups, using_idxmax may be the fastest option:

If there are many (small) groups and the DataFrame is not too large, using_sort_drop may be the fastest option:

Keep in mind, however, that while using_sort_drop, using_sort and using_rank start out looking very fast, as N = len(df) increases, their speed relative to the other options disappears quickly. For large enough N, using_idxmax becomes the fastest option, even if there are many groups.

using_sort_drop, using_sort and using_rank sorts the DataFrame (or groups within the DataFrame). Sorting is O(N * log(N)) on average, while the other methods use O(N) operations. This is why methods like using_idxmax beats using_sort_drop for very large DataFrames.

Be aware that benchmark results may vary for a number of reasons, including machine specs, OS, and software versions. So it is important to run benchmarks on your own machine, and with test data tailored to your situation.

Based on the perfplots above, using_sort_drop may be an option worth considering for your DataFrame of 13M rows, especially if it has many (small) groups. Otherwise, I would suspect using_idxmax to be the fastest option -- but again, it's important that you check benchmarks on your machine.

Here is the setup I used to make the perfplots:

import numpy as np
import pandas as pd 
import perfplot

def make_df(N):
    # lots of small groups
    df = pd.DataFrame(np.random.randint(N//10+1, size=(N, 2)), columns=['Id','delta'])
    # few large groups
    # df = pd.DataFrame(np.random.randint(10, size=(N, 2)), columns=['Id','delta'])
    return df


def using_idxmax(df):
    return df.loc[df.groupby("Id")['delta'].idxmax()]

def max_mask(s):
    i = np.asarray(s).argmax()
    result = [False]*len(s)
    result[i] = True
    return result

def using_custom_mask(df):
    mask = df.groupby("Id")['delta'].transform(max_mask)
    return df.loc[mask]

def using_isin(df):
    idx = df.groupby("Id")['delta'].idxmax()
    mask = df.index.isin(idx)
    return df.loc[mask]

def using_sort(df):
    df = df.sort_values(by=['delta'], ascending=False, kind='mergesort')
    return df.groupby('Id', as_index=False).first()

def using_rank(df):
    mask = (df.groupby('Id')['delta'].rank(method='first', ascending=False) == 1)
    return df.loc[mask]

def using_sort_drop(df):
    # Thanks to jezrael
    # https://stackoverflow.com/questions/50381064/select-the-max-row-per-group-pandas-performance-issue/50389889?noredirect=1#comment87795818_50389889
    return df.sort_values(by=['delta'], ascending=False, kind='mergesort').drop_duplicates('Id')

def using_apply(df):
    selected_idx = df.groupby("Id").apply(lambda df: df.delta.argmax())
    return df.loc[selected_idx]

def check(df1, df2):
    df1 = df1.sort_values(by=['Id','delta'], kind='mergesort').reset_index(drop=True)
    df2 = df2.sort_values(by=['Id','delta'], kind='mergesort').reset_index(drop=True)
    return df1.equals(df2)

perfplot.show(
    setup=make_df,
    kernels=[using_idxmax, using_custom_mask, using_isin, using_sort, 
             using_rank, using_apply, using_sort_drop],
    n_range=[2**k for k in range(2, 20)],
    logx=True,
    logy=True,
    xlabel='len(df)',
    repeat=75,
    equality_check=check)

Another way to benchmark is to use IPython %timeit:

In [55]:  df = make_df(2**20)

In [56]: %timeit using_sort_drop(df)
1 loop, best of 3: 403 ms per loop

In [57]: %timeit using_rank(df)
1 loop, best of 3: 1.04 s per loop

In [58]: %timeit using_idxmax(df)
1 loop, best of 3: 15.8 s per loop

I once realised the fastest is `df = df.sort_values(by=['delta'], ascending=False).drop_duplicates('Id')`, is possible add to timings? — jezrael, May 17 '18 at 11:22
@jezrael: Thanks for suggestion. Your approach (which I called `using_sort_drop` above) is indeed faster for moderately large DataFrames, especially if there are many small groups. But for very large DataFrames, `using_idxmax` will be faster. — unutbu, May 17 '18 at 16:36
"Moderately large" depends on the number of groups. With very few groups, `using_sort_drop` is advantageous until around len(df) = 100K. But if there are many small groups, it might still be the winner with len(df) around 100M. Thus, it depends a lot on the nature of your DataFrame. It's best to install perfplot and/or IPython and do some benchmarking. — unutbu, May 17 '18 at 18:16
Phew, you really know how to dish it out! Going to bookmark this, thanks. — cs95, May 17 '18 at 22:11
Also good to note, correct me if I'm wrong, but sort stability may come in to play. Wouldn't forcing sort to use `mergesort` be better than the default `quicksort`? — piRSquared, Jun 26 '18 at 16:47
@piRSquared: I've been having trouble installing numba, so unfortunately I'm not able to add your method to the perfplot. But your %timeit result looks impressive. — unutbu, Jun 26 '18 at 17:17
@piRSquared: Thanks very much for the correction regarding sort stability. I should have been using `mergesort`. (Correcting this now...) — unutbu, Jun 26 '18 at 17:18
Thanks (-: I like `perfplot` But I couldn't get a plot to appear in notebook. I'll try harder. — piRSquared, Jun 26 '18 at 17:20
Thanks for this - for me on a 20MM row df, sort / drop is absolutely trouncing the other methods. — trance_dude, Feb 10 '21 at 01:52

score 5 · Answer 2 · answered Jun 26 '18 at 16:36

Using Numba's jit

from numba import njit
import numpy as np

@njit
def nidxmax(bins, k, weights):
    out = np.zeros(k, np.int64)
    trk = np.zeros(k)
    for i, w in enumerate(weights - (weights.min() - 1)):
        b = bins[i]
        if w > trk[b]:
            trk[b] = w
            out[b] = i
    return np.sort(out)

def with_numba_idxmax(df):
    f, u = pd.factorize(df.Id)
    return df.iloc[nidxmax(f, len(u), df.delta.values)]

Borrowing from @unutbu

def make_df(N):
    # lots of small groups
    df = pd.DataFrame(np.random.randint(N//10+1, size=(N, 2)), columns=['Id','delta'])
    # few large groups
    # df = pd.DataFrame(np.random.randint(10, size=(N, 2)), columns=['Id','delta'])
    return df

Prime `jit`

with_numba_idxmax(make_df(10));

Test

df = make_df(2**20)


%timeit with_numba_idxmax(df)
%timeit using_sort_drop(df)

47.4 ms ± 99.8 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
194 ms ± 451 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Can't thank you enough @piRSqured, your code made my code run more than 5x faster — mathguy, Dec 17 '18 at 12:06

Select the max row per group - pandas performance issue

2 Answers2

Using Numba's jit

Prime `jit`

Test

Linked

Related

Select the max row per group - pandas performance issue

2 Answers2

Using Numba's jit

Prime jit

Test

Linked

Related

Prime `jit`