How to split a dataframe with multiple curve data points

Question

I have a corporate bond dataframe that has multiple types of bonds with two columns on their yields and years-to-maturity values. When I plot their yields against the years to maturity, I can clearly see at least three, possibly four yield curves. I would like to fit at least three curves on this data and then divide the dataframe into smaller chunks based on distance to the curve. Here's a simple scatter plot:

I have tried the Ransac method described here: Iteratively fitting polynomial curve

This is what I get:

and here's what I tried using RANSAC:

y_ax = df_clean.YTW
x_ax = df_clean.YTM

class PolynomialRegression(object):
    def __init__(self, degree=3, coeffs=None):
        self.degree = degree
        self.coeffs = coeffs

    def fit(self, X, y):
        self.coeffs = np.polyfit(X.ravel(), y, self.degree)

    def get_params(self, deep=False):
        return {'coeffs': self.coeffs}

    def set_params(self, coeffs=None, random_state=None):
        self.coeffs = coeffs

    def predict(self, X):
        poly_eqn = np.poly1d(self.coeffs)
        y_hat = poly_eqn(X.ravel())
        return y_hat

    def score(self, X, y):
        return mean_squared_error(y, self.predict(X))

poly_degree = 3
ransac = RANSACRegressor(PolynomialRegression(degree=poly_degree),
                         residual_threshold=2 * np.std(y_ax),
                         random_state=0)
ransac.fit(np.expand_dims(x_ax, axis=1), y_ax)
inlier_mask = ransac.inlier_mask_

y_hat = ransac.predict(np.expand_dims(x_vals, axis=1))
plt.plot(x_vals, y_vals, 'bx', label='input samples')
plt.plot(x_vals[inlier_mask], y_vals[inlier_mask], 'go', label='inliers (2*STD)')
plt.plot(x_vals, y_hat, 'r-', label='estimated curve')

Shortly:

Is there a way such that I can fit 3-4 separate curves on these plot points?
How can I split the dataframe based on these curves?

The entire data is here (Only YTW & YTM are plotted here): Corp Bonds Data

score 1 · Accepted Answer · answered Jul 05 '22 at 11:28

I did some exploring of your data and this is what I came up with.

First, I noticed you had a lot of different IDs and issuers. I used pandas' groupby function to separate your dataframe into groups based on these two columns. I didn't get anything very interesting with ID, but I did with issuer.

import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
from scipy.optimize import curve_fit

df = pd.read_excel('./CorpBonds_Clean.xlsx')
groups = df.groupby('issuer')
fig, ax = plt.subplots()
for group in groups:
    subdf = group[1].sort_values('YTM')
    ax.plot(subdf['YTM'], subdf['YTW'], marker='o')

This is what I got:

Notice how each curve appears to have its own color, suggesting this separation is related with the exponentials you described. If you move the fig, ax = plt.subplots() line inside the loop, you'll see each group separately. I did that and I saw tons of groups with very few points. I decided on a simple heuristic to separate them: length greater than 5.

Now it's time to fit. I used this exponential, which I got from here

def exp_plateau(x, ym, y0, k):
    return ym - (ym - y0) * np.exp(-k * x)

And I fitted all the subgroups. Here's the result:

Here's the code that generated it:

groups = df.groupby('issuer')
fig, ax = plt.subplots()
param_names = ['ym', 'y0', 'k']
for group in groups:
    subdf = group[1].sort_values('YTM')
    if len(subdf) > 5:
        try:
            popt, pcov = curve_fit(exp_plateau, subdf['YTM'], subdf['YTW'])
        except RuntimeError:
            print(f"{group[0]} couldn't be fit. Skipping")
            continue
        
        l = ax.plot(subdf['YTM'], subdf['YTW'])
        ax.plot(subdf['YTM'], exp_plateau(subdf['YTM'], *popt), color=l[0].get_color(), ls='--')
        print(f'"{group[0]}"', *[f'{param_name}: {i:.2f}+/-{j:.2f}' for param_name, i, j in zip(param_names, popt, np.sqrt(np.diag(pcov)))])
        #ax.set_title(group[0])
ax.set_xlabel('YTM')
ax.set_ylabel('YTW')

And the output with the fit parameters and their estimated errors:

"407 INTER INC CPN STRIP" ym: 5.66+/-0.09 y0: 3.80+/-0.05 k: 0.29+/-0.03
"AGT LTD COUPON STRIP" ym: 6.42+/-1.66 y0: 3.09+/-0.13 k: 0.23+/-0.17
"BANK OF NOVA SCOTIA" ym: 5.47+/-0.24 y0: 3.36+/-0.08 k: 0.43+/-0.10
"BCE COUPON STRIP" ym: 8.04+/-0.47 y0: 3.95+/-0.03 k: 0.05+/-0.01
"BRCOL GENERIC STRIP" ym: 4.30+/-0.01 y0: -0.12+/-0.68 k: 0.29+/-0.02
"CANADIAN IMP BK COMM HK" ym: 3.37+/-0.17 y0: 2.27+/-0.53 k: 5.88+/-5.41
"CANADIAN TIRE CPN STRIP" ym: 7.30+/-0.42 y0: 3.81+/-0.04 k: 0.08+/-0.02
"GREAT-WEST LIFECO CPN ST" ym: 55.02+/-918.22 y0: 3.79+/-0.18 k: 0.00+/-0.06
"GREATER TORONTO CPN STRP" ym: 6.31+/-0.44 y0: 3.56+/-0.04 k: 0.08+/-0.02
"HYDRO ONE STRIP" ym: 5.48+/-0.16 y0: 3.10+/-0.08 k: 0.19+/-0.03
"LEVIS QUE COUPON STRIP" ym: 3.88+/-0.08 y0: 2.81+/-0.03 k: 0.32+/-0.05
LOBLAW COS CPN STRIP couldn't be fit. Skipping
"NEW BRUN GENERIC CPN STP" ym: 4.32+/-0.01 y0: 2.59+/-0.07 k: 0.27+/-0.02
"SAGUENAY CPN STRIP" ym: 3.89+/-0.06 y0: 2.78+/-0.04 k: 0.33+/-0.05
"SUN LIFE FIN SPN STRIP" ym: 6.47+/-0.20 y0: 3.98+/-0.06 k: 0.11+/-0.02
"TELUS CORP COUPON STRIP" ym: 5.99+/-0.04 y0: 3.29+/-0.07 k: 0.22+/-0.01
TORONTO DOMINION STRIP couldn't be fit. Skipping
"TRANS-CANADA CPN STRIP" ym: 6.53+/-0.39 y0: 3.85+/-0.06 k: 0.13+/-0.03
"TRANSALTA CORP CPN STRIP" ym: 8.91+/-1.91 y0: 4.95+/-0.07 k: 0.07+/-0.05
"WINNIPEG COUPON STRIP" ym: 4.70+/-0.03 y0: -1142.59+/-259776205.48 k: 1.06+/-33065.14

This is a very cool way of fitting a yield curve and definitely worth selecting as an answer. Normally what we'd do would be group each issuer by their credit rating but that doesn't always place the correct yield on the correct yield curve. What you did (maybe unintentionally) is that you grouped them by their market perception (issuer name :) ). It is in fact a common way of the market for a long time. Thanks! — Egemen Candir, Jul 05 '22 at 11:52

How to split a dataframe with multiple curve data points

1 Answers1