I recently had to get the last set status for certain items, labeled with ids. I found this answer: Python : How can I get Rows which have the max value of the group to which they belong?
To my surprise on a dataset with only ~2e6 rows it was fairly slow. However I do not need to get all max values, only the last one.
import numpy as np
import pandas as pd
df = pd.DataFrame({
"id": np.random.randint(1, 1000, size=5000),
"status": np.random.randint(1, 10, size=5000),
"date": [
time.strftime("%Y-%m-%d", time.localtime(time.time() - x))
for x in np.random.randint(-5e7, 5e7, size=5000)
],
})
%timeit df.groupby('id').apply(lambda t: t[t.date==t.date.max()])
1 loops, best of 3: 576 ms per loop
%timeit df.reindex(df.sort_values(["date"], ascending=False)["id"].drop_duplicates().index)
100 loops, best of 3: 4.82 ms per loop
The first one was the solution I found in the link, which seems a way which allows more complex operations.
However for my issue I could sort and drop duplicates and reindex, which performs a lot better. Especially on larger data sets this really makes a difference.
My Questions: are there other ways to achieve what I want to do? Possibly with even better performance?