I need help transforming my data so I can read through transaction data.
Business Case
I'm trying to group together some related transactions to create some groups or classes of events. This data set represents workers going out on various leaves of absence events. I want to create one class of leaves based on any transaction falling within 365 days of the leave event class. For charting trends, I want to number the classes so I get a sequence/pattern.
My code allows me to see when the very first event occurred, and it can identify when a new class starts, but it doesn't bucket each transaction into a class.
Requirements:
- Tag all rows based on what leave class they fall into.
- Number each Unique Leave Event. Using this example index 0 would be Unique Leave Event 2, index 1 would be Unique Leave Event 2, index 3 would be Unique Leave Event 2, AND index 4 would be Unique Leave Event 1, etc.
I added in a column for the desired output, labeled as "Desired Output". Note, there can be many more rows/events per person; and there can be many more people.
Some Data
import pandas as pd
data = {'Employee ID': ["100", "100", "100","100","200","200","200","300"],
'Effective Date': ["2016-01-01","2015-06-05","2014-07-01","2013-01-01","2016-01-01","2015-01-01","2013-01-01","2014-01"],
'Desired Output': ["Unique Leave Event 2","Unique Leave Event 2","Unique Leave Event 2","Unique Leave Event 1","Unique Leave Event 2","Unique Leave Event 2","Unique Leave Event 1","Unique Leave Event 1"]}
df = pd.DataFrame(data, columns=['Employee ID','Effective Date','Desired Output'])
Some Code I've Tried
df['Effective Date'] = df['Effective Date'].astype('datetime64[ns]')
df['EmplidShift'] = df['Employee ID'].shift(-1)
df['Effdt-Shift'] = df['Effective Date'].shift(-1)
df['Prior Row in Same Emplid Class'] = "No"
df['Effdt Diff'] = df['Effdt-Shift'] - df['Effective Date']
df['Effdt Diff'] = (pd.to_timedelta(df['Effdt Diff'], unit='d') + pd.to_timedelta(1,unit='s')).astype('timedelta64[D]')
df['Cumul. Count'] = df.groupby('Employee ID').cumcount()
df['Groupby'] = df.groupby('Employee ID')['Cumul. Count'].transform('max')
df['First Row Appears?'] = ""
df['First Row Appears?'][df['Cumul. Count'] == df['Groupby']] = "First Row"
df['Prior Row in Same Emplid Class'][ df['Employee ID'] == df['EmplidShift']] = "Yes"
df['Prior Row in Same Emplid Class'][ df['Employee ID'] == df['EmplidShift']] = "Yes"
df['Effdt > 1 Yr?'] = ""
df['Effdt > 1 Yr?'][ ((df['Prior Row in Same Emplid Class'] == "Yes" ) & (df['Effdt Diff'] < -365)) ] = "Yes"
df['Unique Leave Event'] = ""
df['Unique Leave Event'][ (df['Effdt > 1 Yr?'] == "Yes") | (df['First Row Appears?'] == "First Row") ] = "Unique Leave Event"
df