How do I quickly drop rows based on the max value in a groupby?

Question

I have a large dataframe containing information on people and their job change history. Sometimes, someone had multiple changes to their record on one day, each of which is assigned a transaction sequence number. I just want to keep the rows with the highest transaction sequence number of that day. Currently, I'm using the for loop below to do this, but it takes forever.

list_indexes_to_drop = []
for (associate_id, date), df in df_job_his.groupby(["Employee ID", "Event Date"]):
    if len(df) > 1:
        list_indexes_to_drop += list(df.index[df["Transaction Sequence Number"] != df["Transaction Sequence Number"].max()])

I also have this code below, but I'm not sure how to use it to filter the dataframe.

df_job_his.groupby(["Employee ID", "Event Date"])["Transaction Sequence Number"].max()

Is there a more efficient way to go about this?

Here's an example of some random data in the same format:

df_job_his = pd.DataFrame({"Employee ID": [1, 1, 1, 2, 3, 3, 4, 4, 5, 6, 6, 6, 7, 8, 9, 9, 10], "Event Date": ["2020-04-05", "2020-06-08", "2020-06-08", "2022-09-01", "2022-02-15", "2022-02-15", "2021-07-29", "2021-07-29", "2021-08-14", "2021-09-14", "2022-01-04", "2022-01-04", "2022-01-04", "2022-04-04", "2020-08-13", "2020-08-13", "2020-03-17"], "Transaction Sequence Number": [1, 1, 2, 1, 1, 2, 1, 2, 1, 1, 1, 2, 1, 1, 1, 2, 1]}).groupby(["Employee ID", "Event Date"])

Hi! It would be great if you provided a short dataframe to check this. — Kirill Setdekov, Feb 02 '23 at 07:53
`df_job_his.loc[df_job_his.groupby(["Employee ID", "Event Date"])["Transaction Sequence Number"].idxmax()]` — jezrael, Feb 02 '23 at 08:03

score 3 · Accepted Answer · answered Feb 02 '23 at 08:03

It you groupby was almost a correct answer!

A trick to get the value with the highest "Transaction Sequence Number", would be to use .groupby.last() after sorting the dataframe by Transaction Sequence Number

Here's a solution:

import pandas as pd
import numpy as np
df_job_his = pd.DataFrame({
    'Employee ID': [0, 0, 0, 0, 1, 1, 1],
    'Event Date': [1, 2, 3, 3, 1, 2, 3],
    'Transaction Sequence Number': [1, 2, 4, 3, 5, 6, 7],
    'Important info about transaction': np.random.random(7)
})

df_job_his.sort_values('Transaction Sequence Number').groupby(
    ["Employee ID", "Event Date"]).last()

It outputs something like this, where the employee o for date 3 gets the last row only.

(Employee ID,Event Date)	Transaction Sequence Number	Important info about transaction
(0, 1)	1	0.00571239
(0, 2)	2	0.0484783
(0, 3)	4	0.958739
(1, 1)	5	0.0690461
(1, 2)	6	0.721041
(1, 3)	7	0.763681

How do I quickly drop rows based on the max value in a groupby?

1 Answers1