I have a large dataframe containing information on people and their job change history. Sometimes, someone had multiple changes to their record on one day, each of which is assigned a transaction sequence number. I just want to keep the rows with the highest transaction sequence number of that day. Currently, I'm using the for loop below to do this, but it takes forever.
list_indexes_to_drop = []
for (associate_id, date), df in df_job_his.groupby(["Employee ID", "Event Date"]):
if len(df) > 1:
list_indexes_to_drop += list(df.index[df["Transaction Sequence Number"] != df["Transaction Sequence Number"].max()])
I also have this code below, but I'm not sure how to use it to filter the dataframe.
df_job_his.groupby(["Employee ID", "Event Date"])["Transaction Sequence Number"].max()
Is there a more efficient way to go about this?
Here's an example of some random data in the same format:
df_job_his = pd.DataFrame({"Employee ID": [1, 1, 1, 2, 3, 3, 4, 4, 5, 6, 6, 6, 7, 8, 9, 9, 10], "Event Date": ["2020-04-05", "2020-06-08", "2020-06-08", "2022-09-01", "2022-02-15", "2022-02-15", "2021-07-29", "2021-07-29", "2021-08-14", "2021-09-14", "2022-01-04", "2022-01-04", "2022-01-04", "2022-04-04", "2020-08-13", "2020-08-13", "2020-03-17"], "Transaction Sequence Number": [1, 1, 2, 1, 1, 2, 1, 2, 1, 1, 1, 2, 1, 1, 1, 2, 1]}).groupby(["Employee ID", "Event Date"])