0

I have a DataFrame with almost 14 million rows. I am working with financial options data and ideally I need an interest rate (called risk-free rate) for each option according to it’s time to maturity. According to the literature I’m following, one way to do this is to get US Treasury Bonds interest rates and, for each option, check what is the Treasury Bond rate whose maturity is closest to the time to maturity of the option (in absolute terms). To achieve this I created a loop that will fill a Dataframe with those differences. My code is far from elegant and it is a bit messy because there are combinations of dates and maturities for which there are no rates. Hence the conditionals inside the loop. After the loop is done I can look at what is the maturity with the lowest absolute difference and choose the rate for that maturity. The script was taking so long to run that I added tqdm to have some kind of feedback of what is happening.

I tried running the code. It will take days to complete and it is slowing down as the iterations increase (I know this from tqdm). At first I was adding rows to the differences DataFrame using DataFrame.loc. But as I thought that was the reason the code was slowing down over time, I switched to DataFrame.append. The code is still slow and slowing down over time.

I searched for a way to increase performance and found this question: How to speed up python loop. Someone suggests using Cython but honestly I still consider myself a beginner to Python so from looking at the examples it doesn’t seem something trivial to do. Is that my best option? If it takes a lot of time to learn than I can also do what others do in the literature and just use the 3-month interest rate for all options. But I would prefer not to go there there. Maybe there are other (easy) answers to my problem, please let me know. I include a reproducible code example (although with only 2 rows of data):

from tqdm import tqdm
import pandas as pd


# Treasury maturities, in years
treasury_maturities = [1/12, 2/12, 3/12, 6/12, 1, 2, 3, 5, 7, 10, 20, 30]

# Useful lists
treasury_maturities1 = [3/12, 6/12, 1, 2, 3, 5, 7, 10, 20, 30]
treasury_maturities2 = [1/12]
treasury_maturities3 = [6/12, 1, 2, 3, 5, 7, 10, 20, 30]
treasury_maturities4 = [1, 2, 3, 5, 7, 10, 20, 30]
treasury_maturities5 = [1/12, 2/12, 3/12, 6/12, 1, 2, 3, 5, 7, 10, 20]

# Dataframe that will contain the difference between the time to maturity of option and the different maturities
differences = pd.DataFrame(columns = treasury_maturities)


# Options Dataframe sample
options_list = [[pd.to_datetime("2004-01-02"), pd.to_datetime("2004-01-17"), 800.0, "c",    309.1, 311.1, 1108.49, 1108.49, 0.0410958904109589, 310.1], [pd.to_datetime("2004-01-02"), pd.to_datetime("2004-01-17"), 800.0, "p", 0.0, 0.05, 1108.49, 1108.49, 0.0410958904109589, 0.025]]

options = pd.DataFrame(options_list, columns = ['QuoteDate', 'expiration', 'strike', 'OptionType', 'bid_eod', 'ask_eod', 'underlying_bid_eod', 'underlying_ask_eod', 'Time_to_Maturity', 'Option_Average_Price'])


# Loop
for index, row in tqdm(options.iterrows()):
    if pd.to_datetime("2004-01-02") <= row.QuoteDate <= pd.to_datetime("2018-10-15"):
        if pd.to_datetime("2004-01-02") <= row.QuoteDate <= pd.to_datetime("2006-02-08") and row.Time_to_Maturity > 25:
            list_s = ([abs(maturity - row.Time_to_Maturity) for maturity in 
              treasury_maturities5])
            list_s = [list_s + [40]] # 40 is an arbitrary number bigger than 30
            differences = differences.append(pd.DataFrame(list_s, 
                        columns = treasury_maturities), ignore_index = True) 
        elif (pd.to_datetime("2008-12-10") or pd.to_datetime("2008-12-18") or pd.to_datetime("2008-12-24")) == row.QuoteDate and 1.5/12 <= row.Time_to_Maturity <= 3.5/12:
            list_s = [0, 40, 40]
            list_s = [list_s + [abs(maturity - row.Time_to_Maturity) for 
                                   maturity in treasury_maturities3]]
            differences = differences.append(pd.DataFrame(list_s, 
                        columns = treasury_maturities), ignore_index = True)
        elif (pd.to_datetime("2008-12-10") or pd.to_datetime("2008-12-18") or pd.to_datetime("2008-12-24")) == row.QuoteDate and 3.5/12 < row.Time_to_Maturity <= 4.5/12:    
            list_s = ([abs(maturity - row.Time_to_Maturity) for maturity in 
                           treasury_maturities2])
            list_s = list_s + [40, 40, 0]
            list_s = [list_s + [abs(maturity - row.Time_to_Maturity) for 
                                   maturity in treasury_maturities4]]
            differences = differences.append(pd.DataFrame(list_s, 
                        columns = treasury_maturities), ignore_index = True)
        else:
            if 1.5/12 <= row.Time_to_Maturity <= 2/12:
                list_s = [0, 40]
                list_s = [list_s + [abs(maturity - row.Time_to_Maturity) for maturity in 
              treasury_maturities1]]
                differences = differences.append(pd.DataFrame(list_s, 
                        columns = treasury_maturities), ignore_index = True)
            elif 2/12 < row.Time_to_Maturity <= 2.5/12:
                list_s = ([abs(maturity - row.Time_to_Maturity) for maturity in 
              treasury_maturities2])
                list_s = list_s + [40, 0]
                list_s = [list_s + [abs(maturity - row.Time_to_Maturity) for maturity in 
              treasury_maturities3]]
                differences = differences.append(pd.DataFrame(list_s, 
                        columns = treasury_maturities), ignore_index = True)
            else:
                list_s = [[abs(maturity - row.Time_to_Maturity) for maturity in 
              treasury_maturities]]
                differences = differences.append(pd.DataFrame(list_s, 
                        columns = treasury_maturities), ignore_index = True)
    else:        
        list_s = [[abs(maturity - row.Time_to_Maturity) for maturity in 
              treasury_maturities]]
        differences = differences.append(pd.DataFrame(list_s, 
                        columns = treasury_maturities), ignore_index = True)
Vishal Singh
  • 6,014
  • 2
  • 17
  • 33
Wasonic
  • 39
  • 6
  • 1
    Hi Wasonic and welcome. It would be helpful for us if you could simplify the code provided, perhaps identifying the main problem you are facing. In any case, I'll try to check this later at home. – Lith Mar 22 '21 at 14:34
  • 1
    I'd recommend you start by putting some print statements in and get a feel for where it is taking the longest to execute the script - from experience pd.append is very very slow so my suspicion is that it is the bottleneck – Jimmy Mar 22 '21 at 16:45
  • @Lith Please tell me if you have a question about my code. My problem is the performance of the loop. I need it to not take days and I'm also curious why it is slowing down over time (I think it has to do with df.append. – Wasonic Mar 24 '21 at 07:53
  • @Jimmy Could you tell me where you would put such print statements? – Wasonic Mar 24 '21 at 07:56
  • I mean, basically the only operations you do are list comprehensions and append calls, so yes, the `df.append` will probably be the main reason why your code is slow. I'd like to help you more, but I can't find a quick way to do it. – Lith Mar 24 '21 at 19:03

3 Answers3

2

Short Answer

Loops and if statements are both computationally expensive operations, so look for ways to reduce the number you use.

Loop Optimization: - The best way to speed up a programming loop is to move as much computation as possible out of the loop.

DRY: - Don't Repeat Yourself. You have several redundant if conditions, look into nested if conditions and follow the DRY principle.

Using pandas and numpy

One of the main benefits of libraries such as pandas and numpy is that they are designed for efficiency in mathematical operations on arrays (see Why are numpy arrays so fast?). This means you usually do not have to use loops at all. Instead of creating a new DataFrame inside your loop, create a new column for each value you are computing.

To overcome the issue of different logic for different dates etc, filter rows and apply logic, use a mask/filter to select only the rows you need to operate on instead of using if statements (see pandas filtering tutorial).

Code Example

This code is not a replication of your logic, but an example of how it could be implemented. It's not perfect, but should provide some major efficiency improvements.

import pandas as pd
import numpy as np

# Maturity periods, months and years
month_periods = np.array([1, 2, 3, 6, ], dtype=np.float64)
year_periods = np.array([1, 2, 3, 4, 5, 7, 10, 20, 30, ], dtype=np.float64)

# Create column names for maturities
maturity_cols = [f"month_{m:02.0f}" for m in month_periods] + [f"year_{y:02.0f}" for y in year_periods]

# Normalise months  & concatenate into single array
month_periods = month_periods / 12
maturities = np.concatenate((month_periods, year_periods))

# Create some dummy data
np.random.seed(seed=42)  # Seed PRN generator
date_range = pd.date_range(start="2004-01-01", end="2021-01-30", freq='D')  # Dates to sample from
dates = np.random.choice(date_range, size=n_records, replace=True)
maturity_times = np.random.random(size=n_records)
options = pd.DataFrame(list(zip(dates, maturity_times)), columns=['QuoteDate', 'Time_to_Maturity', ])

# Create date masks
after = options['QuoteDate'] >= pd.to_datetime("2008-01-01")
before = options['QuoteDate'] <= pd.to_datetime("2015-01-01")

# Combine date masks / create flipped version
between = after & before
outside = np.logical_not(between)

# Select data with masks
df_outside = options[outside].copy()
df_between = options[between].copy()

# Smaller dataframes
df_a = df_between[df_between['Time_to_Maturity'] > 25].copy()
df_b = df_between[df_between['Time_to_Maturity'] <= 3.5 / 12].copy()
df_c = df_between[df_between['Time_to_Maturity'] <= 4.5 / 12].copy()
df_d = df_between[
    (df_between['Time_to_Maturity'] >= 2 / 12) & (df_between['Time_to_Maturity'] <= 4.5 / 12)].copy()

# For each maturity period, add difference column using different formula
for i, col in enumerate(maturity_cols):
    # Add a line here for each subset / chunk of data which requires a different formula
    df_a[col] = ((maturities[i] - df_outside['Time_to_Maturity']) + 40).abs()
    df_b[col] = ((maturities[i] - df_outside['Time_to_Maturity']) / 2) .abs()
    df_c[col] = (maturities[i] - df_outside['Time_to_Maturity'] + 1).abs()
    df_d[col] = (maturities[i] - df_outside['Time_to_Maturity'] * 0.8).abs()
    df_outside[col] = (maturities[i] - df_outside['Time_to_Maturity']).abs()

# Concatenate dataframes back to one dataset
frames = [df_outside, df_a, df_b, df_c, df_d, ]
output = pd.concat(frames).dropna(how='any')

output.head()

Average execution time for number of records
Even millions of records are processed quickly (Memory Allowing) | Record | Old Time (secs) | New Time (secs)| Improvement | |-|-|-|-| | 10 | 0.0105 | 0.0244 | -132.38% | | 100 | 0.1078 | 0.0249 | 76.90% | | 1,000 (1k) | 1.03 | 0.0249 | 97.58% | | 10,000 (10k) | 15.629 | 0.0322 | 99.79% | | 100,000 (100k) | 182.014 | 0.065 | 99.96% | | 1,000,000 (1m) | ? | 0.4014 | ? | | 10,000,000 (10m) | ? | 4.7488 | ? | | 14,000,000 (14m) | ? | 6.0172 | ? | | 100,000,000 (100m) | ? | 83.286 | ? |

Further Optimizations

Once you have optimized and profiled your basic code you can also look into multithreading, parallelising you code, or using a different language. Also 14 million records will eat up a lot of RAM - much more than most workstations can handle. To get around this limitation you can read the file itself in chunks and perform your calculations on one chunk at a time:

result_frames = []
for chunk in pd.read_csv("voters.csv", chunksize=10000):
    # Do things here
    result = chunk
    result_frames.append(result)

Google search terms: multiprocessing / threading / Dask / PySpark

Doug
  • 31
  • 5
1

For your problem "divide and conquer" can guide you to the solution. I suggest to split your code in chunk and analyze each part, because, i see some redundance like this:

(pd.to_datetime("2008-12-10") or pd.to_datetime("2008-12-18") or pd.to_datetime("2008-12-24"))

it seems that casting from string to datetime is done at each row. You must profile your code using profile or a more specific tool like perf_tool [*]. It helps you by putting some sentinels in the code and reporting all intermediate times, number of calls, means.

[*] i'm the main developer

Glauco
  • 1,385
  • 2
  • 10
  • 20
  • Well one thing I could do is define those pd.to_datetime() as variables before the loop. Something like: dt1 = pd.to_datetime("2008-12-10"), dt2 = pd.to_datetime("2008-12-18"), etc. – Wasonic Mar 24 '21 at 08:06
  • I also took a look at profile. What am I supposed to do with that exactly?. Sorry for the double comment. I'm new here. – Wasonic Mar 24 '21 at 08:12
  • profiler will guide you in looking for bottlenecks, most called functions, and in general some metrics on your algorithm – Glauco Mar 24 '21 at 08:32
0

As others have already pointed out, please profile your code to find the slowest parts.

Some possible speedups:

Consider using generators instead of lists whenever possible. Also, perhaps using list.extend may be faster than list concatenation.

list_s = ([abs(maturity - row.Time_to_Maturity) for maturity in 
                           treasury_maturities2)

can be

list_s = (abs(maturity - row.Time_to_Maturity) for maturity in 
                           treasury_maturities2)

And

list_s = list_s + [foo, bar, baz]

can be

list_s = list_s.extend([foo, bar, baz])
MAK
  • 26,140
  • 11
  • 55
  • 86