0

I have the following DataFrame containing employment history: (1900-01-01 is used to store a null end date)

id company job_title start_date end_date is_current_employer
A Google Manager 2023-05-06 1900-01-01 1
A Amazon SWE 2021-01-19 2023-02-02 0
A Meta SWE 2017-11-13 2020-10-24 0
B Tesla Research 2020-06-06 2023-07-14 0
B Microsoft Data Eng. 2017-04-17 2019-12-05 0
C Adobe Intern 2022-10-12 1900-01-01 1
C TikTok Intern 2023-03-23 1900-01-01 1

I'm hoping to return a DataFrame with one record for each ID, with that ID's most recent employment, like so:

id company job_title start_date end_date is_current_employer
A Google Manager 2023-05-06 1900-01-01 1
B Tesla Research 2020-06-06 2023-07-14 0
C Adobe Intern 2022-10-12 1900-01-01 1

If a candidate (ID) has only one current employer, that record should be returned. If a candidate has no current employers, the most recent employer should be returned (max start date). If a candidate has multiple current employers, the record with the earlier start date should be returned.

How can I concisely do this in Python?

  • 1
    Welcome to Stack Overflow! Check out the [tour] and [ask], which has tips like how to write a good title. A good title helps attract people and helps them more quickly understand what you're asking. – wjandrea Aug 10 '23 at 23:27
  • *"1900-01-01 is used to store a null end date"* -- Beside the point, but is it possible to avoid doing that, and use `NaT` instead? – wjandrea Aug 10 '23 at 23:28
  • You can get the result you want here by just getting the max start date per group, but I think the edge case you're concerned about isn't actually represented here: when a candidate started and ended at one employer while working for another one. – wjandrea Aug 10 '23 at 23:32
  • *"If a candidate has multiple current employers, the record with the earlier start date should be returned."* -- You mean the *later* start date, right? That's what you're showing for C. – wjandrea Aug 10 '23 at 23:40
  • Yes, edited my original post--I'm looking for the earlier start date, since I want the job they've been at longer. – gracearonsohn Aug 11 '23 at 00:02

1 Answers1

1

You can use group by and first() to do this. First, you must sort by start date to get the most recent employment:

df.sort_values(["start_date"], ascending=False).groupby("id").first()

Output:

id company job_title start_date end_date is_current_employer
A Google Manager 2023-05-06 1900-01-01 1
B Tesla Research 2020-06-06 2023-07-14 0
C TikTok Intern 2023-03-23 1900-01-01 1
Suraj Shourie
  • 536
  • 2
  • 11