5

I recently had to get the last set status for certain items, labeled with ids. I found this answer: Python : How can I get Rows which have the max value of the group to which they belong?

To my surprise on a dataset with only ~2e6 rows it was fairly slow. However I do not need to get all max values, only the last one.

import numpy as np
import pandas as pd

df = pd.DataFrame({
    "id": np.random.randint(1, 1000, size=5000),
    "status": np.random.randint(1, 10, size=5000),
    "date": [
        time.strftime("%Y-%m-%d", time.localtime(time.time() - x))
        for x in np.random.randint(-5e7, 5e7, size=5000)
    ],
})

%timeit df.groupby('id').apply(lambda t: t[t.date==t.date.max()])
1 loops, best of 3: 576 ms per loop

%timeit df.reindex(df.sort_values(["date"], ascending=False)["id"].drop_duplicates().index)
100 loops, best of 3: 4.82 ms per loop

The first one was the solution I found in the link, which seems a way which allows more complex operations.

However for my issue I could sort and drop duplicates and reindex, which performs a lot better. Especially on larger data sets this really makes a difference.

My Questions: are there other ways to achieve what I want to do? Possibly with even better performance?

Community
  • 1
  • 1
galinden
  • 610
  • 8
  • 13
  • One comment: string sorting is slower than numerical sorting, so you can get a factor of ~3 speedup in your second solution by first converting the date column to datetime type: ``df['date'] = pd.to_datetime(df['date'])`` – jakevdp Nov 04 '15 at 12:03

1 Answers1

2

Another way to approach this is to use an aggregation on the groupby, followed by a selection on the full dataframe.

df.iloc[df.groupby('id')['date'].idxmax()]

This appears to be about a factor of 5-10 faster than the solutions you proposed (see below). Note that this will only work if the 'date' column is numerical rather than string type, and that this transformation also speeds up your sorting-based solution:

# Timing your original solutions:
%timeit df.groupby('id').apply(lambda t: t[t.date==t.date.max()])
# 1 loops, best of 3: 826 ms per loop
%timeit df.reindex(df.sort_values(["date"], ascending=False)["id"].drop_duplicates().index)
# 100 loops, best of 3: 5.1 ms per loop

# convert the date
df['date'] = pd.to_datetime(df['date'])

# new times on your solutions
%timeit df.groupby('id').apply(lambda t: t[t.date==t.date.max()])
# 1 loops, best of 3: 815 ms per loop
%timeit df.reindex(df.sort_values(["date"], ascending=False)["id"].drop_duplicates().index)
# 1000 loops, best of 3: 1.99 ms per loop

# my aggregation solution
%timeit df.iloc[df.groupby('id')['date'].idxmax()]
# 10 loops, best of 3: 135 ms per loop
jakevdp
  • 77,104
  • 11
  • 125
  • 160
  • In my original data the dates where already datetime64 but useful information none the less. The aggregation solution gave an acceptable performance on my original data (3min40s) which is workable in this case. This will definitely help me for more complex cases where sorting and removing duplicates does not provide the desired result. Thanks! Also for the quick response. – galinden Nov 04 '15 at 17:38