4
import time
np.random.seed(0)
df = pd.DataFrame({'gr': np.random.choice(7000, 500000),
              'col': np.random.choice(1000, 500000)})
groups = df.groupby('gr')
t1 = time.time()
idx = groups.col.idxmax()
print(round(time.time() - t1,1))
0.7

Is there a way to get these indeces significantly faster than with idxmax()?

Note, I am interested in the idx.values, I don't mind losing the idx.index() of the idx series

Tony
  • 781
  • 6
  • 22

2 Answers2

4

From my side using drop_duplicates is faster than groupby idxmax, around 8 times faster

%timeit df.sort_values(['gr','col']).drop_duplicates('gr',keep='last').index
10 loops, best of 3: 67.3 ms per loop
%timeit df.groupby('gr').col.idxmax()
1 loop, best of 3: 491 ms per loop
Tony
  • 781
  • 6
  • 22
BENY
  • 317,841
  • 20
  • 164
  • 234
  • @jezrael yes you are right , it is dup man :-) , thank you for find it out , also learn from the accepted answer – BENY Jun 26 '18 at 15:12
  • thank you for your answer. Indeed your approach is faster because my example has many smaller groups. However, your solution does not return the same result as my code. Could you please check? – Tony Jun 26 '18 at 15:14
  • @jezrael want me to remove it ? – BENY Jun 26 '18 at 15:15
  • @jezrael I tried , after I converted the com viki , he accepted it , i can not remove it now ... – BENY Jun 26 '18 at 15:17
  • 1
    In order to get the correct results, you need to respect the stability of the sort. With a stable sort, the order will not be changed for those things with the same sorting value. This gets messed up if you sort ascending then take last. So instead, you want to sort descending **and** use a stable sorting algorithm like `mergesort`. Use `df.sort_values(['gr', 'col'], ascending=False, kind='mergesort').drop_duplicates('gr').index` – piRSquared Jun 26 '18 at 15:48
3

Numba just in time compiling

from numba import njit

@njit
def idxmax_(bins, k, weights):
    out = np.zeros(k, np.int64)
    trk = np.zeros(k)
    for i, w in enumerate(weights - (weights.min() - 1)):
        b = bins[i]
        if w > trk[b]:
            trk[b] = w
            out[b] = i
    return np.sort(out)

def idxmax(df):
    f, u = pd.factorize(df.gr)
    return idxmax_(f, len(u), df.col.values)

idxmax(df)

array([   156,    220,    258, ..., 499945, 499967, 499982])

Make sure to prime the function in order to compile it

idxmax(df.head())

Then time it

%timeit idxmax(df)
%timeit df.sort_values(['gr', 'col'], ascending=False).drop_duplicates('gr').index

6.07 ms ± 15.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
152 ms ± 498 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Comparing equality

idx0 = df.groupby('gr').col.idxmax().sort_values().values
idx1 = idxmax(df)
idx2 = df.sort_values(
    ['gr', 'col'],
    ascending=False
).drop_duplicates('gr').index.sort_values().values

print((idx0 == idx1).all(), (idx0 == idx2).all(), sep='\n')

True
True
piRSquared
  • 285,575
  • 57
  • 475
  • 624
  • Thank you very much for your answer. A couple points. For some reason `idxmax()` does not return the same result as `groups.col.idxmax()`. Further, the `drop_duplicates` approach you are timing also does not return the same result as the `idxmax()`. It needs `ascending=True` in `sort_values`, and `keep='last'` in `drop_duplicates`. Finally, even the 2 approaches you are timing do not seem to return the same results at least in my version (`python 3.5.0`, `pandas '0.18.1'` – Tony Jun 26 '18 at 16:54
  • You may have not sorted. I've updated my post verifying equality. – piRSquared Jun 26 '18 at 17:02
  • Oh I see, ok you are 100% right. But if I might ask, why would one need to sort by index number? Wouldn't it make more sense to have the indices sorted by group id (`gr`), which is how the result is returned without the `.sort_values()` bit in the `idxmax()`? Would it be too difficult for me to adapt your solution accordingly? Thanks very much – Tony Jun 26 '18 at 17:18
  • You can adapt my solution by returning `out` instead of `np.sort(out)`. – piRSquared Jun 26 '18 at 17:22