How to only keep the earliest record in a pandas marge

Question

I have two dataframes. Dataframe X has columns:

Index(['studentid', 'Year', 'MCR_NAME', 'QUAL_DETAILS'])

Dataframe Y has columns:

Index(['studentid', 'total'])

I want to merge X and Y on 'studentid' using:

Z = X.merge(Y, on="studentid")

However, the same "studentid" field can occur in different years in X. If this happens, I want to only keep the earliest record after the merge. That is I don't want the same studentid to occur twice in the merged dataframe. The years from the Year field are written "20/21", "21/22", "22/23".

A given studentid can only occur once per year.

How can I do that?

It'd help to provide a [mre]. For specifics, see [How to make good reproducible pandas examples](/q/20109391/4518341). If I'm following correctly, the merge is not crucial to the problem, you'd actually just need to sort `X` by year, group by `studentid`, then select `first`, then you can proceed with the merge. But I'm not sure if the `studentid` can occur multiple times in a year for each `MCR_NAME` and `QUAL_DETAILS`, which might complicate things. — wjandrea, Dec 12 '22 at 21:03
Please add a [MRE](https://stackoverflow.com/help/minimal-reproducible-example) — Rabinzel, Dec 12 '22 at 21:04
@Rabinzel FYI you can write `[mre]` in a comment, and it'll turn into *[mre]* :) There are more shorthands too; check out [comment formatting](/editing-help#comment-formatting). — wjandrea, Dec 12 '22 at 21:07

dzang · Answer 1 · 2022-12-13T10:39:16.743

1

I would drop duplicates before the merge:

Z = Y.merge(X.sort('year').drop_duplicates('studentid'), on="studentid")

edited Dec 13 '22 at 10:39

answered Dec 12 '22 at 21:03

dzang

2,160
2
12
21

How to only keep the earliest record in a pandas marge

1 Answers1