Pandas - Conditional grouping

Question

I have the following DataFrame containing employment history: (1900-01-01 is used to store a null end date)

id	company	job_title	start_date	end_date	is_current_employer
A	Google	Manager	2023-05-06	1900-01-01	1
A	Amazon	SWE	2021-01-19	2023-02-02	0
A	Meta	SWE	2017-11-13	2020-10-24	0
B	Tesla	Research	2020-06-06	2023-07-14	0
B	Microsoft	Data Eng.	2017-04-17	2019-12-05	0
C	Adobe	Intern	2022-10-12	1900-01-01	1
C	TikTok	Intern	2023-03-23	1900-01-01	1

I'm hoping to return a DataFrame with one record for each ID, with that ID's most recent employment, like so:

id	company	job_title	start_date	end_date	is_current_employer
A	Google	Manager	2023-05-06	1900-01-01	1
B	Tesla	Research	2020-06-06	2023-07-14	0
C	Adobe	Intern	2022-10-12	1900-01-01	1

If a candidate (ID) has only one current employer, that record should be returned. If a candidate has no current employers, the most recent employer should be returned (max start date). If a candidate has multiple current employers, the record with the earlier start date should be returned.

How can I concisely do this in Python?

Welcome to Stack Overflow! Check out the [tour] and [ask], which has tips like how to write a good title. A good title helps attract people and helps them more quickly understand what you're asking. — wjandrea, Aug 10 '23 at 23:27
*"1900-01-01 is used to store a null end date"* -- Beside the point, but is it possible to avoid doing that, and use `NaT` instead? — wjandrea, Aug 10 '23 at 23:28
You can get the result you want here by just getting the max start date per group, but I think the edge case you're concerned about isn't actually represented here: when a candidate started and ended at one employer while working for another one. — wjandrea, Aug 10 '23 at 23:32
*"If a candidate has multiple current employers, the record with the earlier start date should be returned."* -- You mean the *later* start date, right? That's what you're showing for C. — wjandrea, Aug 10 '23 at 23:40
Yes, edited my original post--I'm looking for the earlier start date, since I want the job they've been at longer. — gracearonsohn, Aug 11 '23 at 00:02

Suraj Shourie · Accepted Answer · 2023-08-10T23:41:59.790

1

You can use group by and first() to do this. First, you must sort by start date to get the most recent employment:

df.sort_values(["start_date"], ascending=False).groupby("id").first()

Output:

id	company	job_title	start_date	end_date	is_current_employer
A	Google	Manager	2023-05-06	1900-01-01	1
B	Tesla	Research	2020-06-06	2023-07-14	0
C	TikTok	Intern	2023-03-23	1900-01-01	1

edited Aug 10 '23 at 23:41

answered Aug 10 '23 at 23:29

Suraj Shourie

536
2
11

`C Adobe` vs `C TikTok`? – Vitalizzare Aug 10 '23 at 23:51
@Vitalizzare OP edited the question – wjandrea Aug 10 '23 at 23:52
This worked! I mistyped my post originally (when there are two current employers, I want the earlier start date, not later) but this solution still works perfectly after sorting by end date. – gracearonsohn Aug 11 '23 at 00:05

Pandas - Conditional grouping

1 Answers1