I have the below dataset:
The formula for calculating the hazard rate is:
For Year = 1: Hazard_rate(Year) = PD(Year)
For Year > 1: Hazard_rate(Year) = (PD(Year) + Hazard_rate(Year - 1) * (Year - 1)) / (Year)
Assumptions: By customer_ID, the years are monotonic and strictly > 0
As this formula is recursive and requires the previous year's hazard rates, my below code is slow and becomes unmanageable with large datasets, is there a way I can vectorize this operation or at least make the loop faster?
#Calculate the hazard rates
#Initialise an array to collect the hazard rate for each calculation, particularly useful for the recursive nature
#of the formula
hr = []
#Loop through the dataframe, executing the hazard rate formula
#If time_period (year) = 1 then the hazard rate is equal to the pd
for index, row in df.iterrows():
if row["Year"] == 1:
hr.append(row["PD"])
elif row["Year"] > 1:
#Create a row_num variable to indicate what the index is for each unique customer ID
row_num = int(row["Year"])
hr.append((row["PD"] + hr[row_num - 2] * (row["Year"] - 1)) / (row["Year"]))
else:
raise ValueError("Index contains negative or zero values")
#Attach the hazard_rates array to the dataframe
df["hazard_rate"] = hr