My df
looks like this:
sid, class_start, class_end, graduated, grad_date, college_name, degree, major
123 2010-01-01 2010-04-15 NA NA ABC NA NA
123 2010-06-01 2010-09-15 NA NA ABC NA NA
123 2010-10-01 2010-12-15 NA NA ABC NA NA
123 NA NA Y 2010-12-15 ABC BS Biology
123 2011-01-01 2011-04-15 NA NA WRT NA NA
123 2011-06-01 2011-09-15 NA NA WRT NA NA
123 2011-10-01 2011-12-15 NA NA WRT NA NA
123 NA NA Y 2011-12-15 ABC BS Chem
123 2012-01-01 2011-04-15 NA NA ABC NA NA
123 2012-06-01 2011-09-15 NA NA ABC NA NA
123 2012-10-01 2011-12-15 NA NA ABC NA NA
123 NA NA Y 2012-12-15 ABC MS Biology
765 2010-01-01 2010-04-15 NA NA EGF NA NA
765 2010-06-01 2010-09-15 NA NA EGF NA NA
765 2010-10-01 2010-12-15 NA NA EGF NA NA
765 NA NA N NA EGF BS Physics
765 2014-01-01 2014-04-15 NA NA RSE NA NA
765 2014-06-01 2014-09-15 NA NA RSE NA NA
765 2014-10-01 2014-12-15 NA NA RSE NA NA
765 NA NA N NA RSE BS Physics
what I am trying to achieve is for each sid
I want to get the first graduated
value from from the first college
. If a student does not graduate then we get the value of the last college.
Also, note that that Desired Output has a new columns first_start
and last_end
first_start
= students first enrollment datelast_end
= student last enrollment date
For student graduated == Y
, we only take this from college they first graduated.
For student graduated == N
, first_start
is the first college start date, last_end
is the date last_end
of last college
Desired Output
sid, first_start, last_end, graduated, grad_date, college_name, degree, major
123 2010-01-01 2010-12-15 Y 2010-12-15 ABC BS Biology
765 2010-01-01 2014-12-15 N NA RSE BS Physics
My Approach:
- the closest I got is when I group by ['sid', 'college_name'] and then aggregrate value for other columns such as
min
value forfirst_start
andmax
forlast_end
and so on.. that does not get me to the Desired Output
Could someone please help achieve this result?