4

I have the below dataset:

enter image description here

The formula for calculating the hazard rate is:

For Year = 1: Hazard_rate(Year) = PD(Year)

For Year > 1: Hazard_rate(Year) = (PD(Year) + Hazard_rate(Year - 1) * (Year - 1)) / (Year)

Assumptions: By customer_ID, the years are monotonic and strictly > 0

As this formula is recursive and requires the previous year's hazard rates, my below code is slow and becomes unmanageable with large datasets, is there a way I can vectorize this operation or at least make the loop faster?

#Calculate the hazard rates
#Initialise an array to collect the hazard rate for each calculation, particularly useful for the recursive nature 
#of the formula
hr = []

#Loop through the dataframe, executing the hazard rate formula
    #If time_period (year) = 1 then the hazard rate is equal to the pd
for index, row in df.iterrows():
    if row["Year"] == 1:
        hr.append(row["PD"])
    elif row["Year"] > 1:
        #Create a row_num variable to indicate what the index is for each unique customer ID
        row_num = int(row["Year"])
        hr.append((row["PD"] + hr[row_num - 2] * (row["Year"] - 1)) / (row["Year"]))
    else:
        raise ValueError("Index contains negative or zero values")

#Attach the hazard_rates array to the dataframe
df["hazard_rate"] = hr
Avi
  • 362
  • 1
  • 3
  • 11
78282219
  • 159
  • 1
  • 12
  • 1
    Just for clarifying: the dataset you say you have at the beginning is what you want to calculate, and your dataframe only has the ```year``` and ```PD``` columns to start with? – FBruzzesi Nov 24 '19 at 10:18
  • 1
    Would it help to do `df.loc[index, 'hazard_rate'] = *formula results*` instead of working with the list? – Aryerez Nov 24 '19 at 10:24
  • 1
    FBruzzesi, correct - I added the hazard rate column for people to verify their results – 78282219 Nov 24 '19 at 10:32
  • Aryerez, I tried to use .loc in the past. However, as the formula requires the previous result, I couldn't get it to work. Would you be able to show me? – 78282219 Nov 24 '19 at 10:33
  • Is the data sorted by year? are there any gaps between years? can it really happen that one year is <= 0 ? – Óscar López Nov 24 '19 at 10:40
  • 1
    The data will be sorted by year and strictly no gaps between years and strictly no 0 or negative years as these are forecast years – 78282219 Nov 24 '19 at 10:42

1 Answers1

0

This function will computed the n-th Hazard rate

computed = {1: 0.05}
def func(n, computed = computed):
    '''
    Parameters:
        @n: int, year number
        @computed: dictionary with hazard rate already computed
    Returns:
        computed[n]: n-th hazard rate
    '''

    if n not in computed:
        computed[n] = (df.loc[n,'PD'] + func(n-1, computed)*(n-1))/n

    return computed[n]

Now let's compute the Hazard rate for each year:

df.set_index('year', inplace=True)
df['Hazard_rate'] = [func(i) for i in df.index]

Remark that the function doesn't care if the dataframe is sorted by year or not, however I am assuming that the dataframe is indexed by year.

If you want the column back just reset index:

df.reset_index(inplace=True)

As the introduction of Customer_ID, there is more complexity in the process:

#Function depends upon dataframe passed as argument
def func(df, n, computed):

    if n not in computed:
        computed[n] = (df.loc[n,'PD'] + func(n-1, computed)*(n-1))/n

    return computed[n]

#Set index
df.set_index('year', inplace=True)

#Initialize Hazard_rate column
df['Hazard_rate']=0

#Iterate over each customer
for c in df['Customer_ID']:

    #Create a customer mask
    c_mask = (df['Customer_ID'] == c)

    # Initialize computed dictionary for given customer
    c_computed = {1: df.loc[c_mask].loc[1,'PD']}

    df.loc[c_mask]['Hazard_rate'] = [func(df.loc[c_mask], i, c_computed ) for i in df.loc[c_mask].index]

FBruzzesi
  • 6,385
  • 3
  • 15
  • 37
  • As you now introduced a new variable ```Customer_ID```, the above code will not work as intended – FBruzzesi Nov 24 '19 at 11:23
  • Without checking it myself, it looks like your function would be much worse than the OP's, as you re-calculate from scratch the entire path for each *year*, while he uses the laready-calculated result from previous year. – Aryerez Nov 24 '19 at 11:25
  • I can run on one ID and iterate over each ID – 78282219 Nov 24 '19 at 11:26
  • @Aryerez once a year is calculated, it does not caluclate it anymore. This is a typical way of recursion (e.g. this is the fastest way of calculating Fibonacci's numbers in python, see [link](https://stackoverflow.com/questions/18172257/efficient-calculation-of-fibonacci-series) ) – FBruzzesi Nov 24 '19 at 11:28
  • @78282219 just take care of re-initializing the function in each loop, since the function also initializes the dictionary storing what is already computed – FBruzzesi Nov 24 '19 at 11:30
  • I am working on feeding through the first line into the function argument – 78282219 Nov 24 '19 at 11:30
  • @FBruzzesi I don't know what you meant in the link with the 10's answers, but think for yourself what are the calculations your recursive function needs to do to when `i` is for example 10, and compare that to what the OP function does when `year` is 10: While he has a 1-line formula that uses the previous calculated value for year 9, you start a function with 9 recursive calls for that year alone. – Aryerez Nov 24 '19 at 11:35
  • @Aryerez Such recursive function does the same amount of calculations you are referring to. However it does not need sorting the dataframe in any way possible. The only required object to have is a base case for each ```Customer_ID```. I guess you are thinking that each time the function re-computes all values, but it doesn't, since what has been computed is saved in the ```computed``` dictionary, and try to get them from that before you recompute them - this is called _memoization_ – FBruzzesi Nov 24 '19 at 20:01