How can I get pandas to adjust my formula based on a specific value in a dataframe?

Question

I have a pandas dataframe that looks like this:

Emp_ID | Weekly_Hours | Hire_Date | Termination_Date | Salary_Paid    | Multiplier | Hourly_Pay

A1     | 35           | 01/01/1990 | 06/04/2020      | 5000           | 0.229961   | 32.85
B2     | 35           | 02/01/2020 | NaN             | 10000          | 0.229961   | 65.70
C3     | 30           | 23/03/2020 | NaN             | 5800           | 0.229961   | 44.46

The multiplier is a static figure for all employees, calculated as 7 / 30.44. The hourly pay is worked out by multiplying the monthly salary by the multiplier and dividing by the weekly contracted hours.

Now my challenge is to get Pandas to recognise a date in the Termination Date field, and adjust the calculation. For instance, the first record would need to be updated to show that the employee was actually paid 5k through the payroll for 4 business days, not the full month, given that they resigned on 06/04/2020. So the expected hourly pay figure would be (5000 / 4 * 7 / 35) = 250.

I can code the calculation quite easily; my struggle is adding a column to reflect the business days (4 in the above example) in a fresh column for all April leavers (not interested in any other months). So far I have tried.

df['T_Mth_Workdays'] = np.where(df['Termination_Date'].notnull(), np.busday_count('2020-04-01', df['Termination_Date']), 0)

However the above approach returns an error stating that:

iterator operand 0 dtype could not be cast from dtype(' m8 [ns] ') to dtype(' m8 [d] ')

I should add here that I had to change the dates to datetime[ns64] format manually.

Any pointers gratefully received. Thanks!

It seems like the error is US's MM-DD-YY vs EU's DD-MM-YY. You are trying to do calculations using one of each type — noah, Oct 16 '20 at 17:08
You could consider trying `np.busday_count('2020-04', df['Termination_Date'])`. This way it can't be confused. Not sure if that will work or not though — noah, Oct 16 '20 at 17:10
thanks@noah. Unfortunately I need the workdays between two specific dates, so that won't work. Also I had a typo in my question which I've now fixed - see above. thanks! — SQLGIT_GeekInTraining, Oct 16 '20 at 17:14
I found this other [stackoverflow discussion](https://stackoverflow.com/questions/31917964/python-numpy-cannot-convert-datetime64ns-to-datetime64d-to-use-with-numba), which may be of use for converting the `Termination_Date` column from `datetime64[ns]` to `datetime64[d]`, which is what `np.busday_count` requires. Essentially, you'd need to pass `df["Termination_Date"].values.astype('datetime64[D]')` as the second argument to `np.busday_count`, but this doesn't seem to work when there are `NaT`s present in the column (`NaT` = not a time, essentially a NaN). — Ken, Oct 16 '20 at 17:44

Ken · Accepted Answer · 2020-10-16T22:14:26.647

1

The issue with your np.where function call is that it is trying to pass the entire series df["Termination_Date"] as an argument to np.busday_count. The count function fails because it requires arguments to be in the np.datetime64[D] format (i.e., value only specified to the day), and the Series cannot be easily converted to this format.

One solution is to write a custom function that only calls that np.busday_count on elements that are not NaTs, converting those to the datetime64[D] type before calling np.busday_count. Then, you can apply the custom function to the df["Termination_Date"] series, as below:

#!/usr/bin/env python3

import numpy as np
import pandas as pd

DATE_FORMAT = "%d-%m-%Y"

# Reproduce raw data
raw_data = [
    ["A1", 35, "01/01/1990", "06/04/2020", 5000, 0.229961, 32.85],
    ["B2", 35, "02/01/2020", None, 10000, 0.229961, 65.70],
    ["C3", 35, "23/03/2020", "NAT", 5800, 0.229961, 44.46],
]

# Convert raw dates to ISO format, then np.datetime64
def parse_raw_dates(s):
    try:
        spl = s.split("/")
        ds = "%s-%s-%s" %(spl[2], spl[1], spl[0]) 
    except:
        ds = "NAT"
    return np.datetime64(ds)
for line in raw_data:
    line[2] = parse_raw_dates(line[2])

# Create dataframe
df = pd.DataFrame(
    data = raw_data,
    columns = [
        "Emp_ID", "Weekly_Hours", "Hire_Date", "Termination_Date",
        "Salary_Paid", "Multiplier", "Hourly_Pay"],
)

# Create special conversion function
def myfunc(d):
    d = d.to_numpy().astype('datetime64[D]')
    if np.isnat(d):
        return 0
    else:
        return np.busday_count('2020-04-01', d)
df['T_Mth_Workdays'] = df["Termination_Date"].apply(myfunc)

def format_date(d):
    d = d.to_numpy().astype('datetime64[D]')
    if np.isnat(d):
        return ""
    else:
        return pd.to_datetime(d).strftime(DATE_FORMAT)
df["Hire_Date"] = df["Hire_Date"].apply(format_date)
df["Termination_Date"] = df["Termination_Date"].apply(format_date)

edited Oct 16 '20 at 22:14

answered Oct 16 '20 at 18:12

Ken

443
4
8

This is fantastic.. very nearly there. The only bizarre thing is, it converts the termination date to the American style format for some reason which messes up my calculations. How can prevent this from happening? – SQLGIT_GeekInTraining Oct 16 '20 at 20:55
@SQLGIT_GeekInTraining How are you printing and/or outputting the values? – Ken Oct 16 '20 at 21:06
I used the function as you've mentioned above, and eventually exported my results to a csv. – SQLGIT_GeekInTraining Oct 16 '20 at 21:07
It's annoying how we as a world can't agree to write a date in the same way! :D – SQLGIT_GeekInTraining Oct 16 '20 at 21:09
@SQLGIT_GeekInTraining Is the series/column in `np.datetime64` format or pandas `Timestamp` format? If it is the latter, then use `df['Termination_Date'] = df['Termination_Date'].apply(lambda x: x.strftime('%d-%m-%y'))`, before writing the CSV. If it is the former, you can do the same thing, but you'll need to convert it to a pandas `Timestamp` type first using `df['Termination_Date'] = pdf.to_datetime(df['Termination_Date'])`. – Ken Oct 16 '20 at 21:12
@SQLGIT_GeekInTraining when in doubt, you can use ISO-8601 formatting for dates and times. IMHO, day should precede the month anyway :) – Ken Oct 16 '20 at 21:13
Ah actually I need this fixed while writing the function, as I'd like to use the data to recalculate the multiplier. In the example, employee A1's leaving date is incorrectly processed as 04/06/2020, leading to working days come out as 46 when the correct number should be 4. – SQLGIT_GeekInTraining Oct 16 '20 at 21:16
1

Correct, you'll want to do something like the following for Hire Date: `df["Hire_Date"] = df["Hire_Date"].apply(lambda x: pd.to_datetime(x).strftime("%d-%m-%Y"))`, but then you'll need to use another custom function to do the same for `Termination_Date` as it contains `NaT`s. I'll update my solution shortly. – Ken Oct 16 '20 at 21:20
Ok, the error may be coming in when you convert from a String object to `datetime64`. You'll notice in my example that I put the dates in ISO format (YYYY-MM-DD), but I'm not familiar with your raw input (or rather, what happens when you perform the conversion to `datetime64`). – Ken Oct 16 '20 at 21:28
Am just trying out your function. My raw input was a string in the d/m/y format. – SQLGIT_GeekInTraining Oct 16 '20 at 21:29
Just tried out your function, it effectively converted my date back to a string :( I need to use: `df['Workdays'] = df['Termination_Date'].apply(myfunc)` So the above operation now throws an error as the use of `strftime` changed the dates back to a string format. – SQLGIT_GeekInTraining Oct 16 '20 at 21:34
@SQLGIT_GeekInTraining Those `apply` functions to convert to string need to be applied directly before writing the CSV, so that none of the calculations that make use of date are affected. – Ken Oct 16 '20 at 21:36
@Ken- Is there no way of ensuring that the `df['Workdays'] = df['Termination_Date'].apply(myfunc)` can be calculated correctly using the right format? – SQLGIT_GeekInTraining Oct 16 '20 at 21:39
@SQLGIT_GeekInTraining Please see my latest edit to the solution, which takes into account the raw input's non-ISO date format. – Ken Oct 16 '20 at 21:43
Great - thanks. Get an error: list index out of range - for the `line[2] = parse_raw_dates(line[2])` line – SQLGIT_GeekInTraining Oct 16 '20 at 22:07
That error implies that your raw data is different from mine. The `parse_raw_dates` function requires each element of `raw_data` to be a list containing at least 3 members, and where the element at index `2` is a `%d/%m/%Y` string. For good measure, my `parse_raw_dates` function should also handle the 4th column (index `3`), as well as accommodate null dates. I'll update the answer to do so. – Ken Oct 16 '20 at 22:10
Ah ok - my raw data is a dataframe (df), not a list. – SQLGIT_GeekInTraining Oct 16 '20 at 22:12
You mention in your question that you manually convert the dates: "I should add here that I had to change the dates to datetime[ns64] format manually." You will need to make sure that the conversion to datetime64 is being performed correctly, as datetime64 expects ISO-8601 formatted dates. – Ken Oct 16 '20 at 22:25
Yes - that's because when I imported the dates these were dtype object. I needed to convert them to datetime. Nothing seems to work.. The original solution takes me exactly where I need to be except for the (highly frustrating) issue with the date format change to US style.. That's literally the only thing I'm grappling with. – SQLGIT_GeekInTraining Oct 16 '20 at 22:30
If you post a minimal code example of the conversion you are performing, I can add it to my solution. – Ken Oct 16 '20 at 22:39
1

Finally cracked it!!! Needed a simple adjustment to your solution. Will post the solution tomorrow AM UK time - thanks so much for all your help :) – SQLGIT_GeekInTraining Oct 16 '20 at 23:24
thanks so much again - please review my own answer where I've explained the changes I made using your original code as a starting point. Happy to get your thoughts. – SQLGIT_GeekInTraining Oct 17 '20 at 08:05

SQLGIT_GeekInTraining · Answer 2 · 2020-10-17T10:55:21.630

Posting my approach here in case it helps others in the future. Firstly code for creating the dataframe:

d = {'Emp_ID': ['A1', 'B2', 'C3'], 'Weekly Hours': ['35', '35', '30'], 'Hire_Date': ['01/01/1990', '02/01/2020', '23/03/2020'], 
     'Termination_Date': ['06/04/2020', np.nan, np.nan], 'Salary_Paid': [5000, 10000, 5800]}
df = pd.DataFrame(data=d)

df

The first step was to convert the dates to a more useable format - this is where pd.to_datetime() comes in handy -the adjustment needed was to specify the format.

df['Hire_Date'] = pd.to_datetime(df['Hire_Date'], format='%d/%m/%Y')
df['Termination_Date'] = pd.to_datetime(df['Termination_Date'], format='%d/%m/%Y')

This has the desired effect; whereby the dates are correctly represented and April is picked up as the right month of termination for employee A1.

I now (slightly) adjusted Ken's custom solution for calculating the working days in April:

def workday_calc(d):
    d = d.to_numpy().astype('datetime64[D]')
    if np.isnat(d):
        return 30.44
    else:
        d = d.astype(str)
        d = dt.datetime.strptime(d, '%Y-%m-%d')
        e = (d + dt.timedelta(1)).strftime('%Y-%m-%d')
        return np.busday_count('2020-04-01', e, weekmask=[1,1,1,1,1,0,0])

I spotted the error while reviewing numpy documentation on np.busday_count(). There are two useful pointers to note: The use of the datetime64[D] is mandatory in the first line of the function - you can't use pd.to_datetime(). This is because the datetime64[D] format is a pre-requisite to being able to call the np.isnat() function.

However, the minute we deal with the NaT in the dataframe, we need to switch back to a string format, which is needed for the datetime.strptime() function. Using the datetime.strptime() feature, we tell Python that the date is a) represented in the ISO format, and we need to retain it as a string. The advantage with both datetime.strptime() and np.busday_count() is that they are both built to handle strings.

Also, the np.busday_count() excludes the end date, so I used timedelta() to increment the end date by one, so that all the dates in the interim are counted. This may or may not be appropriate given what you're trying to do, but I wanted an inclusive count of days worked in April. So in this case, the employee has worked for 4 business days in April.

We then simply apply the custom function and create a new column.

df['Days_Worked_April'] = df['Termination_Date'].apply(workday_calc)

I was now able to use the freshly created column to derive my multiplier - using the same old approach. The rest is simple, but I'm including the code and results below for completeness.

df['Multiplier'] = df.apply(lambda x: 7 / x['Days_Worked_April'], axis=1)
df['Hourly_Pay_Calc'] = round((df.apply(lambda x: x['Salary_Paid'] * x['Multiplier'] / x['Weekly Hours'], axis=1)), 2)

Output:

Emp_ID  Weekly Hours    Hire_Date   Termination_Date    Salary_Paid Days_Worked_April   Multiplier  Hourly_Pay_Calc
0   A1  35.0    1990-01-01  2020-04-06  5000    4.00    1.750000    250.00
1   B2  35.0    2020-01-02  NaT 10000   30.44   0.229961    65.70
2   C3  30.0    2020-03-23  NaT 5800    30.44   0.229961    44.46

How can I get pandas to adjust my formula based on a specific value in a dataframe?

2 Answers2