Create groups/classes based on conditions within columns

Question

I need help transforming my data so I can read through transaction data.

Business Case

I'm trying to group together some related transactions to create some groups or classes of events. This data set represents workers going out on various leaves of absence events. I want to create one class of leaves based on any transaction falling within 365 days of the leave event class. For charting trends, I want to number the classes so I get a sequence/pattern.

My code allows me to see when the very first event occurred, and it can identify when a new class starts, but it doesn't bucket each transaction into a class.

Requirements:

Tag all rows based on what leave class they fall into.
Number each Unique Leave Event. Using this example index 0 would be Unique Leave Event 2, index 1 would be Unique Leave Event 2, index 3 would be Unique Leave Event 2, AND index 4 would be Unique Leave Event 1, etc.

I added in a column for the desired output, labeled as "Desired Output". Note, there can be many more rows/events per person; and there can be many more people.

Some Data

import pandas as pd

data = {'Employee ID': ["100", "100", "100","100","200","200","200","300"],
        'Effective Date': ["2016-01-01","2015-06-05","2014-07-01","2013-01-01","2016-01-01","2015-01-01","2013-01-01","2014-01"],
        'Desired Output': ["Unique Leave Event 2","Unique Leave Event 2","Unique Leave Event 2","Unique Leave Event 1","Unique Leave Event 2","Unique Leave Event 2","Unique Leave Event 1","Unique Leave Event 1"]}
df = pd.DataFrame(data, columns=['Employee ID','Effective Date','Desired Output'])

Some Code I've Tried

df['Effective Date'] = df['Effective Date'].astype('datetime64[ns]')
df['EmplidShift'] = df['Employee ID'].shift(-1)
df['Effdt-Shift'] = df['Effective Date'].shift(-1)
df['Prior Row in Same Emplid Class'] = "No"
df['Effdt Diff'] = df['Effdt-Shift'] - df['Effective Date']
df['Effdt Diff'] = (pd.to_timedelta(df['Effdt Diff'], unit='d') + pd.to_timedelta(1,unit='s')).astype('timedelta64[D]')
df['Cumul. Count'] = df.groupby('Employee ID').cumcount()


df['Groupby'] = df.groupby('Employee ID')['Cumul. Count'].transform('max')
df['First Row Appears?'] = ""
df['First Row Appears?'][df['Cumul. Count'] == df['Groupby']] = "First Row"
df['Prior Row in Same Emplid Class'][ df['Employee ID'] == df['EmplidShift']]  = "Yes"

df['Prior Row in Same Emplid Class'][ df['Employee ID'] == df['EmplidShift']]  = "Yes"

df['Effdt > 1 Yr?'] = ""                                        
df['Effdt > 1 Yr?'][ ((df['Prior Row in Same Emplid Class'] == "Yes" ) & (df['Effdt Diff'] < -365))  ] = "Yes"

df['Unique Leave Event'] = ""
df['Unique Leave Event'][ (df['Effdt > 1 Yr?'] == "Yes") | (df['First Row Appears?'] == "First Row") ] = "Unique Leave Event" 

df

score 4 · Accepted Answer · edited May 23 '17 at 11:48

You can do this without having to loop or iterate through your dataframe. Per Wes McKinney you can use .apply() with a groupBy object and define a function to apply to the groupby object. If you use this with .shift() (like here) you can get your result without using any loops.

Terse example:

# Group by Employee ID
grouped = df.groupby("Employee ID")
# Define function 
def get_unique_events(group):
    # Convert to date and sort by date, like @Khris did
    group["Effective Date"] = pd.to_datetime(group["Effective Date"])
    group = group.sort_values("Effective Date")
    event_series = (group["Effective Date"] - group["Effective Date"].shift(1) > pd.Timedelta('365 days')).apply(lambda x: int(x)).cumsum()+1
    return event_series

event_df = pd.DataFrame(grouped.apply(get_unique_events).rename("Unique Event")).reset_index(level=0)
df = pd.merge(df, event_df[['Unique Event']], left_index=True, right_index=True)
df['Output'] = df['Unique Event'].apply(lambda x: "Unique Leave Event " + str(x))
df['Match'] = df['Desired Output'] == df['Output']

print(df)

Output:

  Employee ID Effective Date        Desired Output  Unique Event  \
3         100     2013-01-01  Unique Leave Event 1             1
2         100     2014-07-01  Unique Leave Event 2             2
1         100     2015-06-05  Unique Leave Event 2             2
0         100     2016-01-01  Unique Leave Event 2             2
6         200     2013-01-01  Unique Leave Event 1             1
5         200     2015-01-01  Unique Leave Event 2             2
4         200     2016-01-01  Unique Leave Event 2             2
7         300        2014-01  Unique Leave Event 1             1

                 Output Match
3  Unique Leave Event 1  True
2  Unique Leave Event 2  True
1  Unique Leave Event 2  True
0  Unique Leave Event 2  True
6  Unique Leave Event 1  True
5  Unique Leave Event 2  True
4  Unique Leave Event 2  True
7  Unique Leave Event 1  True

More verbose example for clarity:

import pandas as pd

data = {'Employee ID': ["100", "100", "100","100","200","200","200","300"],
        'Effective Date': ["2016-01-01","2015-06-05","2014-07-01","2013-01-01","2016-01-01","2015-01-01","2013-01-01","2014-01"],
        'Desired Output': ["Unique Leave Event 2","Unique Leave Event 2","Unique Leave Event 2","Unique Leave Event 1","Unique Leave Event 2","Unique Leave Event 2","Unique Leave Event 1","Unique Leave Event 1"]}
df = pd.DataFrame(data, columns=['Employee ID','Effective Date','Desired Output'])

# Group by Employee ID
grouped = df.groupby("Employee ID")

# Define a function to get the unique events
def get_unique_events(group):
     # Convert to date and sort by date, like @Khris did
    group["Effective Date"] = pd.to_datetime(group["Effective Date"])
    group = group.sort_values("Effective Date")
    # Define a series of booleans to determine whether the time between dates is over 365 days
    # Use .shift(1) to look back one row
    is_year = group["Effective Date"] - group["Effective Date"].shift(1) > pd.Timedelta('365 days')
    # Convert booleans to integers (0 for False, 1 for True)
    is_year_int = is_year.apply(lambda x: int(x))    
    # Use the cumulative sum function in pandas to get the cumulative adjustment from the first date.
    # Add one to start the first event as 1 instead of 0
    event_series = is_year_int.cumsum() + 1
    return event_series

# Run function on df and put results into a new dataframe
# Convert Employee ID back from an index to a column with .reset_index(level=0)
event_df = pd.DataFrame(grouped.apply(get_unique_events).rename("Unique Event")).reset_index(level=0)

# Merge the dataframes
df = pd.merge(df, event_df[['Unique Event']], left_index=True, right_index=True)

# Add string to match desired format
df['Output'] = df['Unique Event'].apply(lambda x: "Unique Leave Event " + str(x))

# Check to see if output matches desired output
df['Match'] = df['Desired Output'] == df['Output']

print(df)

You get the same output:

  Employee ID Effective Date        Desired Output  Unique Event  \
3         100     2013-01-01  Unique Leave Event 1             1
2         100     2014-07-01  Unique Leave Event 2             2
1         100     2015-06-05  Unique Leave Event 2             2
0         100     2016-01-01  Unique Leave Event 2             2
6         200     2013-01-01  Unique Leave Event 1             1
5         200     2015-01-01  Unique Leave Event 2             2
4         200     2016-01-01  Unique Leave Event 2             2
7         300        2014-01  Unique Leave Event 1             1

                 Output Match
3  Unique Leave Event 1  True
2  Unique Leave Event 2  True
1  Unique Leave Event 2  True
0  Unique Leave Event 2  True
6  Unique Leave Event 1  True
5  Unique Leave Event 2  True
4  Unique Leave Event 2  True
7  Unique Leave Event 1  True

That's an elegant solution. The only danger could lie in the `merge` if OP is using really huge dataframes, but judging from the content of the data that's unlikely. — Khris, Oct 10 '16 at 08:23

Khris · Answer 2 · 2016-10-06T06:15:25.990

This is a bit clunky but it yields the right output at least for your small example:

import pandas as pd

data = {'Employee ID': ["100", "100", "100","100","200","200","200","300"],
        'Effective Date': ["2016-01-01","2015-06-05","2014-07-01","2013-01-01","2016-01-01","2015-01-01","2013-01-01","2014-01-01"],
        'Desired Output': ["Unique Leave Event 2","Unique Leave Event 2","Unique Leave Event 2","Unique Leave Event 1","Unique Leave Event 2","Unique Leave Event 2","Unique Leave Event 1","Unique Leave Event 1"]}
df = pd.DataFrame(data, columns=['Employee ID','Effective Date','Desired Output'])

df["Effective Date"] = pd.to_datetime(df["Effective Date"])
df = df.sort_values(["Employee ID","Effective Date"]).reset_index(drop=True)

for i,_ in df.iterrows():
  df.ix[0,"Result"] = "Unique Leave Event 1"
  if i < len(df)-1:
    if df.ix[i+1,"Employee ID"] == df.ix[i,"Employee ID"]:
      if df.ix[i+1,"Effective Date"] - df.ix[i,"Effective Date"] > pd.Timedelta('365 days'):
        df.ix[i+1,"Result"] = "Unique Leave Event " + str(int(df.ix[i,"Result"].split()[-1])+1)
      else:
        df.ix[i+1,"Result"] = df.ix[i,"Result"]
    else:
      df.ix[i+1,"Result"] = "Unique Leave Event 1"

Note that this code assumes that the first row always contains the string Unique Leave Event 1.

EDIT: Some explanation.

First I convert the dates to datetime format and then reorder the dataframe such that the dates for every Employee ID are ascending.

Then I iterate over the rows of the frame using the built-int iterator iterrows. The _ in for i,_ is merely a placeholder for the second variable I do not use because the iterator gives back both row numbers and rows, I only need the numbers here.

In the iterator I'm doing row-wise comparisons so by default I fill in the first row by hand and then assign to the i+1-th row. I do it like this because I know the value of the first row but not the value of the last row. Then I compare the i+1-th row with the i-th row within an if-safeguard because i+1 would give an index-error on the last iteration.

In the loop I first check if the Employee ID has changed between the two rows. If it has not then I compare the dates of the two rows and see if they are apart more than 365 days. If this is the case I read the string "Unique Leave Event X" from the i-th row, increase the number by one and write it in the i+1-row. If the dates are closer I just copy the string from the previous row.

If the Employee ID does change on the other hand I just write "Unique Leave Event 1" to start over.

Note 1: iterrows() has no options to set so I can't iterate only over a subset.

Note 2: Always iterate using one of the built-in iterators and only iterate if you can't solve the problem otherwise.

Note 3: When assigning values in an iteration always use ix, loc, or iloc.

Thanks! Can you please provide some commentary on how you did this? — June, Sep 30 '16 at 21:19
Hi, and sorry for the long wait, I only comment here from work and we had a three-day weekend. I will add some comment now. — Khris, Oct 04 '16 at 06:00

Create groups/classes based on conditions within columns

2 Answers2