I have a DataFrame with almost 14 million rows. I am working with financial options data and ideally I need an interest rate (called risk-free rate) for each option according to it’s time to maturity. According to the literature I’m following, one way to do this is to get US Treasury Bonds interest rates and, for each option, check what is the Treasury Bond rate whose maturity is closest to the time to maturity of the option (in absolute terms). To achieve this I created a loop that will fill a Dataframe with those differences. My code is far from elegant and it is a bit messy because there are combinations of dates and maturities for which there are no rates. Hence the conditionals inside the loop. After the loop is done I can look at what is the maturity with the lowest absolute difference and choose the rate for that maturity. The script was taking so long to run that I added tqdm to have some kind of feedback of what is happening.
I tried running the code. It will take days to complete and it is slowing down as the iterations increase (I know this from tqdm). At first I was adding rows to the differences DataFrame using DataFrame.loc. But as I thought that was the reason the code was slowing down over time, I switched to DataFrame.append. The code is still slow and slowing down over time.
I searched for a way to increase performance and found this question: How to speed up python loop. Someone suggests using Cython but honestly I still consider myself a beginner to Python so from looking at the examples it doesn’t seem something trivial to do. Is that my best option? If it takes a lot of time to learn than I can also do what others do in the literature and just use the 3-month interest rate for all options. But I would prefer not to go there there. Maybe there are other (easy) answers to my problem, please let me know. I include a reproducible code example (although with only 2 rows of data):
from tqdm import tqdm
import pandas as pd
# Treasury maturities, in years
treasury_maturities = [1/12, 2/12, 3/12, 6/12, 1, 2, 3, 5, 7, 10, 20, 30]
# Useful lists
treasury_maturities1 = [3/12, 6/12, 1, 2, 3, 5, 7, 10, 20, 30]
treasury_maturities2 = [1/12]
treasury_maturities3 = [6/12, 1, 2, 3, 5, 7, 10, 20, 30]
treasury_maturities4 = [1, 2, 3, 5, 7, 10, 20, 30]
treasury_maturities5 = [1/12, 2/12, 3/12, 6/12, 1, 2, 3, 5, 7, 10, 20]
# Dataframe that will contain the difference between the time to maturity of option and the different maturities
differences = pd.DataFrame(columns = treasury_maturities)
# Options Dataframe sample
options_list = [[pd.to_datetime("2004-01-02"), pd.to_datetime("2004-01-17"), 800.0, "c", 309.1, 311.1, 1108.49, 1108.49, 0.0410958904109589, 310.1], [pd.to_datetime("2004-01-02"), pd.to_datetime("2004-01-17"), 800.0, "p", 0.0, 0.05, 1108.49, 1108.49, 0.0410958904109589, 0.025]]
options = pd.DataFrame(options_list, columns = ['QuoteDate', 'expiration', 'strike', 'OptionType', 'bid_eod', 'ask_eod', 'underlying_bid_eod', 'underlying_ask_eod', 'Time_to_Maturity', 'Option_Average_Price'])
# Loop
for index, row in tqdm(options.iterrows()):
if pd.to_datetime("2004-01-02") <= row.QuoteDate <= pd.to_datetime("2018-10-15"):
if pd.to_datetime("2004-01-02") <= row.QuoteDate <= pd.to_datetime("2006-02-08") and row.Time_to_Maturity > 25:
list_s = ([abs(maturity - row.Time_to_Maturity) for maturity in
treasury_maturities5])
list_s = [list_s + [40]] # 40 is an arbitrary number bigger than 30
differences = differences.append(pd.DataFrame(list_s,
columns = treasury_maturities), ignore_index = True)
elif (pd.to_datetime("2008-12-10") or pd.to_datetime("2008-12-18") or pd.to_datetime("2008-12-24")) == row.QuoteDate and 1.5/12 <= row.Time_to_Maturity <= 3.5/12:
list_s = [0, 40, 40]
list_s = [list_s + [abs(maturity - row.Time_to_Maturity) for
maturity in treasury_maturities3]]
differences = differences.append(pd.DataFrame(list_s,
columns = treasury_maturities), ignore_index = True)
elif (pd.to_datetime("2008-12-10") or pd.to_datetime("2008-12-18") or pd.to_datetime("2008-12-24")) == row.QuoteDate and 3.5/12 < row.Time_to_Maturity <= 4.5/12:
list_s = ([abs(maturity - row.Time_to_Maturity) for maturity in
treasury_maturities2])
list_s = list_s + [40, 40, 0]
list_s = [list_s + [abs(maturity - row.Time_to_Maturity) for
maturity in treasury_maturities4]]
differences = differences.append(pd.DataFrame(list_s,
columns = treasury_maturities), ignore_index = True)
else:
if 1.5/12 <= row.Time_to_Maturity <= 2/12:
list_s = [0, 40]
list_s = [list_s + [abs(maturity - row.Time_to_Maturity) for maturity in
treasury_maturities1]]
differences = differences.append(pd.DataFrame(list_s,
columns = treasury_maturities), ignore_index = True)
elif 2/12 < row.Time_to_Maturity <= 2.5/12:
list_s = ([abs(maturity - row.Time_to_Maturity) for maturity in
treasury_maturities2])
list_s = list_s + [40, 0]
list_s = [list_s + [abs(maturity - row.Time_to_Maturity) for maturity in
treasury_maturities3]]
differences = differences.append(pd.DataFrame(list_s,
columns = treasury_maturities), ignore_index = True)
else:
list_s = [[abs(maturity - row.Time_to_Maturity) for maturity in
treasury_maturities]]
differences = differences.append(pd.DataFrame(list_s,
columns = treasury_maturities), ignore_index = True)
else:
list_s = [[abs(maturity - row.Time_to_Maturity) for maturity in
treasury_maturities]]
differences = differences.append(pd.DataFrame(list_s,
columns = treasury_maturities), ignore_index = True)