Calculate difference between dates for sequential pandas rows based on conditional column value

Question

I need to find the number of days between a request date and its most recent offer date for each apartment number. My example dataframe looks like the first 3 columns below and I'm trying to figure out how to calculate the 'days_since_offer' column. The apartment and or_date columns are already sorted.

apartment   offer_req_type  or_date days_since_offer
A   request 12/4/2019   n/a
A   request 12/30/2019  n/a
A   offer   3/4/2020    0
A   request 4/2/2020    29
A   request 6/4/2020    92
A   request 8/4/2020    153
A   offer   12/4/2020   0
A   request 1/1/2021    28
B   offer   1/1/2019    0
B   request 8/1/2019    212
B   offer   10/1/2019   0
B   request 1/1/2020    92
B   request 9/1/2020    244
B   offer   1/1/2021    0
B   request 1/25/2021   24

I tried to create a new function which sort of gives me what I want if I pass it dates for a single apartment. When I use the apply function is gives me an error though: "SpecificationError: Function names must be unique if there is no new column names assigned".

def func(attr, date_ser):
offer_dt = date(1900,1,1)
lapse_days = []
for row in range(len(attr)):
    if attr[row] == 'offer':
        offer_dt = date_ser[row]
        lapse_days.append(-1)
    else:
        lapse_days.append(date_ser[row]-offer_dt)
print(lapse_days)
return lapse_days

df['days_since_offer'] = df.apply(func(df['offer_req_type'], df['or_date']))

I also tried to use groupby + diff functions like this and this but it's not the answer that I need:

df.groupby('offer_req_type').or_date.diff().dt.days

I also looked into using the shift method, but I'm not necessarily looking at sequential rows every time.

Any pointers on why my function is failing or if there is a better way to get the date differences that I need using a groupby method would be helpful!

JarroVGIT · Accepted Answer · 2021-02-11T09:23:21.063

I have played around and I am certainly not claiming this to be the best way. I used df.apply() (edit: see below for alternative without df.apply()).

import numpy as np
import pandas as pd

# SNIP: removed the df creation part for brevity.

df["or_date"] = pd.to_datetime(df["or_date"])
df.drop("days_since_offer", inplace=True, axis="columns")

def get_last_offer(row:pd.DataFrame, df: pd.DataFrame):
    if row["offer_req_type"] == "offer":
        return
    temp_df = df[(df.apartment == row['apartment']) & (df.offer_req_type == "offer") & (df.or_date < row["or_date"])]
    if temp_df.empty:
        return
    else:
        x = row["or_date"]
        y = temp_df.iloc[-1:, -1:]["or_date"].values[0]
        return  x-y    
    return 1

df["days_since_offer"] = df.apply(lambda row: get_last_offer(row, df), axis=1)
print(df)

This returns the following df:

0          A        request 2019-12-04              NaT
1          A        request 2019-12-30              NaT
2          A          offer 2020-03-04              NaT
3          A        request 2020-04-02          29 days
4          A        request 2020-06-04          92 days
5          A        request 2020-08-04         153 days
6          A          offer 2020-12-04              NaT
7          A        request 2021-01-01          28 days
8          B          offer 2019-01-01              NaT
9          B        request 2019-08-01         212 days
10         B          offer 2019-10-01              NaT
11         B        request 2020-01-01          92 days
12         B        request 2020-09-01         336 days
13         B          offer 2021-01-01              NaT
14         B        request 2021-01-25          24 days

EDIT I was wondering if I couldn't find a way not using df.apply(). I ended up with the following lines: (replace from line def get_last_offer() in previous code bit)

df["offer_dates"] =  np.where(df['offer_req_type'] == 'offer', df['or_date'], pd.NaT)
# OLD: df["offer_dates"].ffill(inplace=True)
df["offer_dates"] = df.groupby("apartment")["offer_dates"].ffill()
df["diff"] = pd.to_datetime(df["or_date"]) - pd.to_datetime(df["offer_dates"])
df.drop("offer_dates", inplace=True, axis="columns")

This creates a helper column (df['offer_dates']) which is filled for every row that has offer_req_type as 'offer'. Then it is forward-filled, meaning that every NaT value will be replaced with the previous valid value. Then We calculate the df['diff'] column, with the exact same result. I like this bit better because it is cleaner and it has 4 lines rather than 12 lines of code :)

Thanks! Edited version actually creates minor inconsistencies with actual dataset. If an apartment only has offer date, with no reactive dates and it's subsequent apartment row has reactive dates that are prior to the first offer date, then it creates a negative number or false entry. For example, add the row ['C', 'offer', date(2020, 4, 6)] to the beginning of the dataset. The first method you posted worked perfectly because it compares apartment numbers instead of just comparing sequential rows. — KLANGE, Feb 10 '21 at 18:29
Yes you are right, I hadn't considered such case. To counter that, you can use the `groupby()`. I've updated the second bit and tested it with the added egde case, and it works like a charm now :) — JarroVGIT, Feb 11 '21 at 09:26

Calculate difference between dates for sequential pandas rows based on conditional column value

1 Answers1

Linked