0

Hi I have a Dataframe with 3 columns as below. I would like to calculate the earliest "Start Date" for each Employee_ID using Python Pandas. e.g. for Employee_ID SNC1289037 , Start_Date will be 1/1/2020 and End_Date will be 12/31/2999, as there are no breaks in the time periods. for Employee_ID SNC2289038, Start_Date will be 3/1/2020 and End_Date will be 12/31/2999, as there is a break in the time period.

Employee_ID|Start_date | End_date
---------------------------------
SNC1289037  1/1/2020    1/31/2020
SNC1289037  2/1/2020    2/29/2020
SNC1289037  3/1/2020    3/30/2020 
SNC1289037  4/1/2020    12/31/2999 
SNC2289038  1/1/2020    1/31/2020
SNC2289038  3/1/2020    3/30/2020
SNC2289038  4/1/2020    12/31/2999
SNC4589038  1/1/2020    1/31/2020
SNC4589038  2/1/2020    2/29/2020
SNC4589038  3/1/2020    3/30/2020

Thank you very much for your help.

saloua
  • 2,433
  • 4
  • 27
  • 37
Pandas
  • 1
  • 2
  • Please don't include images of things that don't absolutely need to be images. We put in effort answering your questions, please put in the effort to properly provide the example data. – orlp Nov 09 '20 at 00:34
  • @Pandas--do you need dates like 12/31/2999 in your data? Dates this large causes errors when converting to Date Object since dates in Pandas are limited to 64-bit timestamps. Meaning would a date such as 12/31/2199 be sufficient? – DarrylG Nov 09 '20 at 01:30

1 Answers1

0

Dates such as 12/31/2999 cause issues with Pandas Datetime functions i.e. reference.

Assumption: okay to limit dates to say 12/31/2199 (rather than 12/31/2999)

Code

# Data
s = '''Employee_ID Start_date End_date
SNC1289037 1/1/2020 1/31/2020
SNC1289037 2/1/2020 2/29/2020
SNC1289037 3/1/2020 3/30/2020
SNC1289037 4/1/2020 12/31/2199
SNC2289038 1/1/2020 1/31/2020
SNC2289038 3/1/2020 3/30/2020
SNC2289038 4/1/2020 12/31/2199
SNC4589038 1/1/2020 1/31/2020
SNC4589038 2/1/2020 2/29/2020
SNC4589038 3/1/2020 3/30/2020'''

# Data to Pandas DataFrame
from io import StringIO
df = pd.read_csv(StringIO(s), delim_whitespace=True)

# Convert date columns to dates
df['Start_date']= pd.to_datetime(df['Start_date'])
df['End_date']= pd.to_datetime(df['End_date'])

# Show Earliest Start date, and last End date by Group (using aggregate function)
df.groupby('Employee_ID').agg({'Start_date' : "min", 'End_date': "max"})

Output

            Start_date  End_date
Employee_ID     
SNC1289037  2020-01-01  2199-12-31
SNC2289038  2020-01-01  2199-12-31
SNC4589038  2020-01-01  2020-03-30
DarrylG
  • 16,732
  • 2
  • 17
  • 23