7

I'm selecting one max row per group and I'm using groupby/agg to return index values and select the rows using loc.

For example, to group by "Id" and then select the row with the highest "delta" value:

selected_idx = df.groupby("Id").apply(lambda df: df.delta.argmax())
selected_rows = df.loc[selected_idx, :]

However, it's so slow this way. Actually, my i7/16G RAM laptop hangs when I'm using this query on 13 million rows.

I have two questions for experts:

  1. How can I make this query run fast in pandas? What am I doing wrong?
  2. Why is this operation so expensive?

[Update] Thank you so much for @unutbu 's analysis! sort_drop it is! On my i7/32GRAM machine, groupby+idxmax hangs for nearly 14 hours (never return a thing) however sort_drop handled it LESS THAN A MINUTE!

I still need to look at how pandas implements each method but problems solved for now! I love StackOverflow.

smci
  • 32,567
  • 20
  • 113
  • 146
aerin
  • 20,607
  • 28
  • 102
  • 140
  • Related: [How do I find: Is the first non-NaN value in each column the maximum for that column in a DataFrame?](https://stackoverflow.com/questions/52349507/how-do-i-find-is-the-first-non-nan-value-in-each-column-the-maximum-for-that-co). Would be nice to also see how those answers' performance compares and scales. – smci Sep 16 '18 at 01:00

2 Answers2

13

The fastest option depends not only on length of the DataFrame (in this case, around 13M rows) but also on the number of groups. Below are perfplots which compare a number of ways of finding the maximum in each group:

If there an only a few (large) groups, using_idxmax may be the fastest option: enter image description here

If there are many (small) groups and the DataFrame is not too large, using_sort_drop may be the fastest option: enter image description here

Keep in mind, however, that while using_sort_drop, using_sort and using_rank start out looking very fast, as N = len(df) increases, their speed relative to the other options disappears quickly. For large enough N, using_idxmax becomes the fastest option, even if there are many groups.

using_sort_drop, using_sort and using_rank sorts the DataFrame (or groups within the DataFrame). Sorting is O(N * log(N)) on average, while the other methods use O(N) operations. This is why methods like using_idxmax beats using_sort_drop for very large DataFrames.

Be aware that benchmark results may vary for a number of reasons, including machine specs, OS, and software versions. So it is important to run benchmarks on your own machine, and with test data tailored to your situation.

Based on the perfplots above, using_sort_drop may be an option worth considering for your DataFrame of 13M rows, especially if it has many (small) groups. Otherwise, I would suspect using_idxmax to be the fastest option -- but again, it's important that you check benchmarks on your machine.


Here is the setup I used to make the perfplots:

import numpy as np
import pandas as pd 
import perfplot

def make_df(N):
    # lots of small groups
    df = pd.DataFrame(np.random.randint(N//10+1, size=(N, 2)), columns=['Id','delta'])
    # few large groups
    # df = pd.DataFrame(np.random.randint(10, size=(N, 2)), columns=['Id','delta'])
    return df


def using_idxmax(df):
    return df.loc[df.groupby("Id")['delta'].idxmax()]

def max_mask(s):
    i = np.asarray(s).argmax()
    result = [False]*len(s)
    result[i] = True
    return result

def using_custom_mask(df):
    mask = df.groupby("Id")['delta'].transform(max_mask)
    return df.loc[mask]

def using_isin(df):
    idx = df.groupby("Id")['delta'].idxmax()
    mask = df.index.isin(idx)
    return df.loc[mask]

def using_sort(df):
    df = df.sort_values(by=['delta'], ascending=False, kind='mergesort')
    return df.groupby('Id', as_index=False).first()

def using_rank(df):
    mask = (df.groupby('Id')['delta'].rank(method='first', ascending=False) == 1)
    return df.loc[mask]

def using_sort_drop(df):
    # Thanks to jezrael
    # https://stackoverflow.com/questions/50381064/select-the-max-row-per-group-pandas-performance-issue/50389889?noredirect=1#comment87795818_50389889
    return df.sort_values(by=['delta'], ascending=False, kind='mergesort').drop_duplicates('Id')

def using_apply(df):
    selected_idx = df.groupby("Id").apply(lambda df: df.delta.argmax())
    return df.loc[selected_idx]

def check(df1, df2):
    df1 = df1.sort_values(by=['Id','delta'], kind='mergesort').reset_index(drop=True)
    df2 = df2.sort_values(by=['Id','delta'], kind='mergesort').reset_index(drop=True)
    return df1.equals(df2)

perfplot.show(
    setup=make_df,
    kernels=[using_idxmax, using_custom_mask, using_isin, using_sort, 
             using_rank, using_apply, using_sort_drop],
    n_range=[2**k for k in range(2, 20)],
    logx=True,
    logy=True,
    xlabel='len(df)',
    repeat=75,
    equality_check=check)

Another way to benchmark is to use IPython %timeit:

In [55]:  df = make_df(2**20)

In [56]: %timeit using_sort_drop(df)
1 loop, best of 3: 403 ms per loop

In [57]: %timeit using_rank(df)
1 loop, best of 3: 1.04 s per loop

In [58]: %timeit using_idxmax(df)
1 loop, best of 3: 15.8 s per loop
unutbu
  • 842,883
  • 184
  • 1,785
  • 1,677
  • 3
    I once realised the fastest is `df = df.sort_values(by=['delta'], ascending=False).drop_duplicates('Id')`, is possible add to timings? – jezrael May 17 '18 at 11:22
  • 1
    Sure. Running the benchmark now; will take some time. – unutbu May 17 '18 at 11:29
  • 1
    @jezrael: Thanks for suggestion. Your approach (which I called `using_sort_drop` above) is indeed faster for moderately large DataFrames, especially if there are many small groups. But for very large DataFrames, `using_idxmax` will be faster. – unutbu May 17 '18 at 16:36
  • 1
    "Moderately large" depends on the number of groups. With very few groups, `using_sort_drop` is advantageous until around len(df) = 100K. But if there are many small groups, it might still be the winner with len(df) around 100M. Thus, it depends a lot on the nature of your DataFrame. It's best to install perfplot and/or IPython and do some benchmarking. – unutbu May 17 '18 at 18:16
  • Phew, you really know how to dish it out! Going to bookmark this, thanks. – cs95 May 17 '18 at 22:11
  • Also good to note, correct me if I'm wrong, but sort stability may come in to play. Wouldn't forcing sort to use `mergesort` be better than the default `quicksort`? – piRSquared Jun 26 '18 at 16:47
  • @piRSquared: I've been having trouble installing numba, so unfortunately I'm not able to add your method to the perfplot. But your %timeit result looks impressive. – unutbu Jun 26 '18 at 17:17
  • @piRSquared: Thanks very much for the correction regarding sort stability. I should have been using `mergesort`. (Correcting this now...) – unutbu Jun 26 '18 at 17:18
  • Thanks (-: I like `perfplot` But I couldn't get a plot to appear in notebook. I'll try harder. – piRSquared Jun 26 '18 at 17:20
  • Thanks for this - for me on a 20MM row df, sort / drop is absolutely trouncing the other methods. – trance_dude Feb 10 '21 at 01:52
5

Using Numba's jit

from numba import njit
import numpy as np

@njit
def nidxmax(bins, k, weights):
    out = np.zeros(k, np.int64)
    trk = np.zeros(k)
    for i, w in enumerate(weights - (weights.min() - 1)):
        b = bins[i]
        if w > trk[b]:
            trk[b] = w
            out[b] = i
    return np.sort(out)

def with_numba_idxmax(df):
    f, u = pd.factorize(df.Id)
    return df.iloc[nidxmax(f, len(u), df.delta.values)]

Borrowing from @unutbu

def make_df(N):
    # lots of small groups
    df = pd.DataFrame(np.random.randint(N//10+1, size=(N, 2)), columns=['Id','delta'])
    # few large groups
    # df = pd.DataFrame(np.random.randint(10, size=(N, 2)), columns=['Id','delta'])
    return df

Prime jit

with_numba_idxmax(make_df(10));

Test

df = make_df(2**20)


%timeit with_numba_idxmax(df)
%timeit using_sort_drop(df)

47.4 ms ± 99.8 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
194 ms ± 451 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
piRSquared
  • 285,575
  • 57
  • 475
  • 624