0

I have two dataframes. Dataframe X has columns:

Index(['studentid', 'Year', 'MCR_NAME', 'QUAL_DETAILS'])

Dataframe Y has columns:

Index(['studentid', 'total'])

I want to merge X and Y on 'studentid' using:

Z = X.merge(Y, on="studentid")

However, the same "studentid" field can occur in different years in X. If this happens, I want to only keep the earliest record after the merge. That is I don't want the same studentid to occur twice in the merged dataframe. The years from the Year field are written "20/21", "21/22", "22/23".

A given studentid can only occur once per year.

How can I do that?

Simd
  • 19,447
  • 42
  • 136
  • 271
  • 1
    It'd help to provide a [mre]. For specifics, see [How to make good reproducible pandas examples](/q/20109391/4518341). If I'm following correctly, the merge is not crucial to the problem, you'd actually just need to sort `X` by year, group by `studentid`, then select `first`, then you can proceed with the merge. But I'm not sure if the `studentid` can occur multiple times in a year for each `MCR_NAME` and `QUAL_DETAILS`, which might complicate things. – wjandrea Dec 12 '22 at 21:03
  • Please add a [MRE](https://stackoverflow.com/help/minimal-reproducible-example) – Rabinzel Dec 12 '22 at 21:04
  • @Rabinzel FYI you can write `[mre]` in a comment, and it'll turn into *[mre]* :) There are more shorthands too; check out [comment formatting](/editing-help#comment-formatting). – wjandrea Dec 12 '22 at 21:07
  • @wjandrea A given studentid can only occur once per year – Simd Dec 12 '22 at 21:18

1 Answers1

1

I would drop duplicates before the merge:

Z = Y.merge(X.sort('year').drop_duplicates('studentid'), on="studentid")
dzang
  • 2,160
  • 2
  • 12
  • 21