0

My df looks like this:

sid, class_start,   class_end,   graduated, grad_date,  college_name,  degree,  major
123  2010-01-01     2010-04-15   NA         NA          ABC            NA       NA    
123  2010-06-01     2010-09-15   NA         NA          ABC            NA       NA    
123  2010-10-01     2010-12-15   NA         NA          ABC            NA       NA    
123  NA             NA           Y          2010-12-15  ABC            BS     Biology 
123  2011-01-01     2011-04-15   NA         NA          WRT            NA       NA    
123  2011-06-01     2011-09-15   NA         NA          WRT            NA       NA    
123  2011-10-01     2011-12-15   NA         NA          WRT            NA       NA    
123  NA             NA           Y          2011-12-15  ABC            BS     Chem    
123  2012-01-01     2011-04-15   NA         NA          ABC            NA       NA    
123  2012-06-01     2011-09-15   NA         NA          ABC            NA       NA    
123  2012-10-01     2011-12-15   NA         NA          ABC            NA       NA    
123  NA             NA           Y          2012-12-15  ABC            MS     Biology 
765  2010-01-01     2010-04-15   NA         NA          EGF            NA       NA    
765  2010-06-01     2010-09-15   NA         NA          EGF            NA       NA    
765  2010-10-01     2010-12-15   NA         NA          EGF            NA       NA    
765  NA             NA           N          NA          EGF            BS     Physics 
765  2014-01-01     2014-04-15   NA         NA          RSE            NA       NA    
765  2014-06-01     2014-09-15   NA         NA          RSE            NA       NA    
765  2014-10-01     2014-12-15   NA         NA          RSE            NA       NA    
765  NA             NA           N          NA          RSE            BS     Physics 

what I am trying to achieve is for each sid I want to get the first graduated value from from the first college. If a student does not graduate then we get the value of the last college.

Also, note that that Desired Output has a new columns first_start and last_end

  • first_start = students first enrollment date
  • last_end = student last enrollment date

For student graduated == Y, we only take this from college they first graduated.

For student graduated == N, first_start is the first college start date, last_end is the date last_end of last college

Desired Output

sid, first_start,   last_end,   graduated, grad_date,  college_name,  degree,  major
123  2010-01-01     2010-12-15   Y         2010-12-15  ABC            BS     Biology    
765  2010-01-01     2014-12-15   N          NA         RSE            BS     Physics 

My Approach:

  • the closest I got is when I group by ['sid', 'college_name'] and then aggregrate value for other columns such as min value for first_start and max for last_end and so on.. that does not get me to the Desired Output

Could someone please help achieve this result?

floss
  • 2,603
  • 2
  • 20
  • 37
  • Does it help ? [Looks like the same issue](https://stackoverflow.com/questions/68049365/select-first-value-of-specific-column-for-each-id-in-sorted-pandas-data-frame) – Pompedup Nov 19 '22 at 21:50
  • thanks for suggesting but it does not take care of all the required output – floss Nov 19 '22 at 23:57

1 Answers1

0

Does it works?

df_graduated = df[df.graduated == "Y"].sort_values(by=['grad_date']).groupby("sid", as_index=False).first()
graduated_ids = df_graduated["sid"]
df_not_graduated = df[~df.sid.isin(graduated_ids)].groupby("sid", as_index=False).last()
df_start_dates = df[df.class_start != "NA"].sort_values(by=['class_start']).groupby("sid").agg(first_start=('class_start', np.min))
df_end_dates = df[df.class_start != "NA"].sort_values(by=['class_end']).groupby("sid").agg(last_end=('class_end', np.max))
df_concatenated = pd.concat([df_graduated, df_not_graduated])
df_dates = pd.merge(df_start_dates, df_end_dates, how='inner', on = 'sid')
print(pd.merge(df_dates, df_concatenated, how='inner', on = 'sid')[["sid", "first_start", "last_end", "graduated", "grad_date", "college_name", "degree", "major"]])

There is a different on the last_end for 123 but it's the last enrollment date is I'm right so it's ok

   sid first_start    last_end graduated   grad_date college_name degree    major
0  123  2010-01-01  2011-12-15         Y  2010-12-15          ABC     BS  Biology
1  765  2010-01-01  2014-12-15         N          NA          RSE     BS  Physics
Pompedup
  • 566
  • 2
  • 8